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ABSTRACT 


Markov Random Fields (MRFs) or undirected graphical models are parsimonious representa¬ 
tions of joint probability distributions. Variables correspond to nodes of a graph, with edges 
between nodes corresponding to conditional dependencies. For a pairwise MRF , the joint 
density factorizes as a product over edges of the graph. This thesis studies high-dimensional, 
continuous-valued pairwise MRFs. We are particularly interested in approximating pairwise 
densities whose logarithm belongs to a Sobolev space. For this problem we propose the 


method of exponential series Crain, 1974 Barron and Sheu, 1991 , which approximates the 
log density by a finite-dimensional exponential family with the number of sufficient statistics 
increasing with the sample size. 

We consider two approaches to estimating these models. The first is regularized max¬ 
imum likelihood. This involves optimizing the sum of the log-likelihood of the data and 
a sparsity-inducing regularizes We provide consistency and edge selection guarantees for 
this method. We then propose a variational approximation to the likelihood based on tree- 
reweighted, nonparametric message passing. This approximation allows for upper bounds 
on risk estimates, leverages parallelization and is scalable to densities on hundreds of nodes. 
We show how the regularized variational MLE may be estimated using a proximal gradient 
algorithm. We demonstrate our method’s efficacy in density estimation and model selection 
in comparison to other approaches in the literature using simulated data and MEG signal 
data. 

We then consider estimation using regularized score matching. This approach uses an 
alternative scoring rule to the log-likelihood, which obviates the need to compute the nor¬ 
malizing constant of the distribution. For general continuous-valued exponential families, we 
provide parameter and edge consistency results. As a special case we detail a new approach 
to sparse precision matrix estimation which has statistical performance competitive with the 


graphical lasso Yuan and Lin, 2007 and computational performance competitive with the 
state-of-the-art glasso algorithm |Friedman et ah, 20081. We then describe results for model 
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selection in the nonparametric pairwise model using exponential series. The regularized 
score matching problem is shown to be a convex program; we provide scalable algorithms 


based on consensus alternating direction method of multipliers 
and coordinate-wise descent. We use simulations to compare 
literature as well as the aforementioned TRW estimator. 


(ADMM, |Boyd et ah, 2011]) 


our method to others in the 
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CHAPTER 1 


INTRODUCTION 


Density estimation is one of the fundamental tools in statistics and machine learning Sil¬ 


verman 


1986 . Nonparametric density estimators are used for prediction, goodness-of-ht 


testing Fan, 1994 Bickel and Rosenblatt, 1973 , generative models for classification Fix 


and Hodges, 1989 John and Langley |1995 , inferring independence and conditional inde¬ 
pendence, estimating statistical functionals Beirlant et al.[ 1997 Poczos et al., 2012 


well as exploratory data analysis and visualization. In many fields including the biological 
and social sciences, information technology and machine learning, it is now commonplace to 
analyze large data sets with complex dependencies between many variables. This includes 
gene expression levels measured using microarrays, brain activity measurements from fMRI 
or MEG technology, prices of financial instruments, and the activity of individuals on social 
media web sites. The use of nonparametric density estimators is limited, however, by the 
curse of dimensionality: with even moderate sample size, high-dimensional space will invari¬ 
ably have large regions where the data is sparse, leading to uninformative predictions. As 
such nonparametric estimators typically have poor risk guarantees in high-dimensions. Fur¬ 
thermore, many methods suffer from a computational curse of dimensionality and heuristics 
or approximations to tune the models become necessary. 

In this work we consider nonparametric estimation of pairwise densities, a class of den¬ 
sities intimitely tied to undirected graphical models. The undirected graphical model or 
Markov Random Field [Jordan, 2004; Lauritzen, 1996] is a well-studied framework for rep¬ 
resenting joint dependence structures of random variables. An undirected graph G = (V, E ) 
consists of a vertex set V = {1,..., d} corresponding to the elements of the random vector 
X = (X\,.. ., X ( j), and an edge set E C V x V. Each edge e G E is an unordered pair of 
elements j,k EV, e = (j , k). For any subset A C V, we define the subset {A"^ : Xj, i e A}. 
Furthermore, for sets A, B, C we write X\ II Xq\Xq to mean Xj± and X q are independent 

conditional on X(j. The random vector X is Markov with respect to the graph G = (V, E) 
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if for every j, k G V, Xj II Xj„\ (Ay : l ^ j, k) if and only if (i, j) ^ E. 

The fundamental theorem of undirected graphical models is the Hammersley-Clifford 


theorem Dobruschin, 1968 , which states that if the density of A" p{x) is positive, then the 
following are equivalent: 


1. A" is Markov with respect to the graph G; 

2. The density of A", p(x ) can be factorized over the cliques of G: 


p( x ) = 11 V’c(zC'), 

Ced(G) 


where X(j = {x j : i G G}. 


( 1 . 1 ) 


This thesis considers nonparametric estimation of the pairwise graphical model for continuous¬ 
valued data, where the joint density can be further factored into a product of potential 
functions over edges: 


P( x )=Y[A( x i) 11 ^jk( x ji x k)- 
iGV (j,k)&E 


( 1 . 2 ) 


Pairwise graphical models have been used extensively in modeling discrete data. The Ising 


model Ising, 1925 for {0, l}-valued variables has the density 


p(x) = exp l 6 i x i + 6 jk x j x k ~ Z ( 6 ) S’ , 

(ieV (j,k)eE ) 

which can be seen as a pairwise graphical model with ^ = expand i[)jj, = exp {Oj^XjX^} 
and Z(9) a normalizing constant. The Ising model can be generalized to discrete variables 
with more than two levels, but there are a finite possibilities for pairwise discrete potentials 
with finite number of levels. 

The class of continuous-valued pairwise models is considerably more complex than dis¬ 
crete ones, as the potential functions could be any positive-valued functions such 
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that p integrates to 1. The class of continuous pairwise graphical models includes some 
familiar models, which we describe below. 


Example 1.0.1. Gaussian graphical model 

Let X G be a Gaussian-distributed random variable with mean p and covariance 
matrix E((A — p)(A — p) T ) = £ >~ 0. Denote D = ZD 1 . Then X has density 


p(x) 




p)~^Q(x 


oc 


n ex p 

ieV 


GjjXj + (Dp,) jXi 


x IJ exp | - ^QijXiXj 



From the factorization above, it can be seen that the Gaussian graphical model is not only 
a Markov random held, but also belongs to the pairwise class of densities. Following from 
, observing that the Gaussian density is positive over M^, two Gaussian variables XpXj 
are conditionally independent given the others if and only if (ZD 1 )^- = 0. 

Example 1.0.2. Gaussian copula graphical model 

Let A" be Gaussian distributed with mean p and covariance matrix Z, and suppose that for 
each % G V , Y{ = g{(Xj) where gi is some monotonic increasing, smooth function. Suppose 
further that the gi are centered and scaled so that E[^(Aj)] = E[A^] and var[cp(Aj] = 
var[Aj], Write / = p _1 . Denote the vector of functions / = (/l, /2, ■ ■ •, fd) and their 
derivatives by f' = (/{,/ 2 , ■ • •After applying a change of variables to the Gaussian 
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density, we find that Y has density 


p(y) = 


Q 


(2vr) d 


exp 


(f(y) - v) T n(f(y) - n) \ JJ f'iiVi 

ieV 


oc Yl fliVi) ex P 

ieV 

x J| exp 
(hi)e-B 


2 ^iifiiVi) T i^iMVi) 


2 ^ij fi(Vi)fj ( Vj ) 


Thus the Gaussian copula density is also a Markov random held and has a pairwise 
factorization. Following from (1.1), two variables Y % , Y) are conditionally independent given 
the others if and only if fly- = 0. Gaussian copulas have been used extensively in finance 


and risk management Cherubini et al., 2004 for their ability to model dependence between 
many variables which are (marginally) non-Gaussian. 


Example 1.0.3. Forests 

A tree T is an undirected graph where each pair of vertices is connected by exactly 
one simple path. Equivalently, a tree is a connected graph with no cycles. A forest is an 
undirected graph where each pair of vertices is connected by at no more than one simple 
path. Equivalently, a forest is a graph with no cycles. A spanning tree T of a connected 
graph G is a tree containing the vertices of G and a subset of the edges of G. A spanning 
forest of a graph G is a graph consisting of a spanning tree for each connected component 
of G. In the sequel will use the term spanning tree to refer to a spanning tree or forest 
unambiguously whether or not G is connected. 

A tree or forest must have cliques of size no more than two. Thus, from the Hammersley- 
Clifford theorem, a tree density has the factorization 


iev (i,j)eE 


(1.3) 


It follows that all tree distributions have pairwise densities. In particular, a tree can always 
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be factorized in the form 


P{x) = II Pi( x d 

ieV 


n 

(hj)€E 


Pij(Xj,Xj) 

Pi(xi)pj(xjY 


(1.4) 


where {pi} are the set of univariate densities and {pij} are the set of bivariate densities of 
the joint distribution p. 


1.1 Previous Work 


Little work has been done in studying the nonparametric estimation of pairwise densities. 


Gu, 2002, 1993 considered log-ANOVA density estimation, where the log-density can be 


factored into low-order terms, pairwise densities being a special case. They suggest an 
estimator for the log-density: 


mm 

V 


Li±^ )+loe j 

v k= 1 


e' + 


H 


(1.5) 


such that r] has a given pairwise factorization, where || • \\p[ is a norm in a reproducing 
kernel Hilbert space (RKHS); the resulting density estimate is proportional to e^ x \ Due 
to the representer theorem |Kimcldorf and Wahba, 1971 , this becomes a finite-dimensional 
optimization problem. However, due to the difficulty of computing log J e 71 it can only be 
used in low-dimensional problems, such as dimension up to 3. This work also assumes 


the ANOVA factorization structure is known. Jeon and Lin, 2006 proposed a smoothing 


spline estimator based on minimizing the Bregman score (4.8) for log-ANOVA densities, and 


applied it to undirected graphical model estimation. This solves the problem 


mm < — 
v I n 


+ [ v(y)p(y)dy + M\v\\h \, 


( 1 . 6 ) 


k =1 
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for some given density p; the resulting density estimate is proportional to p(x)e r) ^ x K The 
authors show that solving this problem only requires calculation of one-dimensional integrals, 
and is thus more scalable. This approach has some limitations. The issue of computing the 
normalizing constant remains; this is a problem for inference and for choosing the smoothing 
parameter A when using cross-validation to minimize the KL risk. Also, their procedure for 
graph selection is a heuristic. The theoretical properties of this estimator are not yet known. 


When the graph is assumed to be a forest (Example 1.0.3), density estimation and struc¬ 


ture learning was considered in Liu et al. 2011 . Due to the tree entropy factorization (2.19), 
learning a forest density can be done in two steps: estimating the univariate and bivariate 
marginals, and learning the graph structure. For the first task, they estimate marginals 
using kernel density estimation. For the second task, they use a nonparametric estimator of 


mutual information (2.11), and then estimate the maximum likelihood forest using Kruskal’s 
algorithm (Figure |3~4 ), with weights corresponding to mutual information between edges. 
They show consistency guarantees for their approach in terms of KL risk and graph selection. 

There has been a large amount of work on parametric pairwise models. For learning 
Gaussian models with sparse precision matrix, the graphical lasso, or the L \-regularized 


maximum likelihood is the most popular approach Yuan and Lin, 2007 ; this solves the 
problem 


min trace(DE) — log |D| + A||f2||i 


(1.7) 


E being the sample covariance matrix E = ^ Ylk= i(A^)(A^) t and ||G||i = Yhi j \^ij\ • The 
glasso algorithm solves the resulting problem using block-coordinate descent |Banerjee et al. 


2008; jFriedman et al., 2008 . The graphical lasso is known to have good properties in terms 
of parameter and structure selection consistency |Rothman et al. 2008 Ravikumar et al. 


2011 . There exist other approaches for sparse estimation of Q such as the the graphical 


Danzig selector Yuan, 2010 , and CLIME |Cai et ah, 2011 . The parallel lasso Meinshausen 
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and Biihlmann 2006 infers graph structure in a Gaussian graphical model without direct 


estimation of the covariance or precision matrix by running a sequence of neighborhood lasso 
fits in parallel. For a node i, they estimate the neighborhood of i by solving 


n 


0 l = argmin — V (X 
/3*=0 l 2n t^l 


P T X k ) + All/311! 


( 1 . 8 ) 


The corresponding neighborhood of i is the support of /3 l . They show consistency of neigh¬ 
borhood estimates, which may then be aggregated to form an edge set. The precision matrix 
can then be fit by estimating the Gaussian likelihood subject to sparsity constraints on the 


precision matrix. Liu et al. 2012a proposes the SKEPTIC estimator for structure learning 


of the semiparametric Gaussian copula model. The estimator plugs in a matrix of rank 


correlations (Kendall’s r or Spearman’s p ) into the graphical lasso (1.7); they show this 


estimator achieves the parametric rates for edge selection and parameter estimation. 

This thesis includes several contributions to the literature. In Chapter [2] we introduce 
the exponential series approximation to pairwise densities. We propose an estimator for 
pairwise densities based on regularized maximum likelihood estimation of a particular ex¬ 
ponential family whose sufficient statistics are basis elements. We use a method for edge 
selection using convex regularization, and provide risk and model selection guarantees in 
Section 2J3 While the exact problem is in general not tractable, in Chapter [3] we propose a 
convex variational upper bound on the likelihood based on a nonparametric tree-reweighted 


relaxation Wainwright et ah, 2005, 2003 , which can be computed efficiently and in parallel. 


Our method provides an upper bound on the normalizing constant, guaranteeing an upper 
bound on risk estimates. The approximation leads to a natural variational maximum like¬ 
lihood estimator, as well as an approach for marginalization. We train our method using 
a projected gradient algorithm, which can effortlessly be scaled to relatively sparse graphs 


on hundreds of nodes. In Section T6 we compare our method to several other approaches 
to large-scale density estimation, including the graphical lasso, mixtures of Gaussians with 
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the EM algorithm, and kernel forest density estimation. We demonstrate our method by 
estimating the graph from an MEG neuroimaging dataset. 


In Chapter 4 we consider a different approach to estimation and graph selection using 
an alternative scoring rule to the log-likelihood. It is based on minimizing the log-gradient 
between the model distribution and data distribution, or equivalently minimizing the Fisher 
divergence. This method obviates the need for computing a normalizing constant. We 
show that the optimization amounts to a second-order cone program, and provide two types 
of scalable algorithms specially tailored to the problem. Our method, which we denote 
QUASR for Quadratic Scoring and Regularization, produces parameter and graph selection 
consistency for general pairwise exponential families with only weak regularity conditions. 
En route, we derive a new method for sparse precision matrix estimation which performs 


competitively with the regularized MLE (1.7). Finally, we show how this approach produces 
graph selection guarantees for the pairwise nonparametric model when the sufficient statistics 
are basis elements. 


1.2 Notation and Preliminaries 


Throughout this thesis we assume we are given independent and identically distributed data 
X 1 , A" 2 ,..., X n , where X^ = (X^,..., Xj ), drawn from the density p(x) with respect to a 
reference measure v(x). In the context of density estimation, we assume the unknown log 
density / = logp belongs to the Sobolev space of functions on [0,1]^, W 2 , so that for any 
multi-index a with | a |< r, 


f M := XU 

dx i 1 •■ ■ dxj d 


(1.9) 


has bounded Z^z') norm: 


\\f {a) \\ := f I / (Q ° I 2 v{dx) < oo. 
X 
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( 1 . 10 ) 




This implies that p is bounded away from zero and infinity and that p(x) has bounded 
support. 


We use the asymptotic notations O(-), o(-), Q(-) and x. For two functions f(n),g(n), 
f = 0(g) if f/g < c for a constant c > 0 as n -> oo; f = o(g ) if f/g —» 0 as n —> oo; 
/ = fl(g) if / > eg for some c > 0 as n —> oo; / x g if / = eg for some c > 0 as n —> oo. We 
say that / = O p (g) if / = 0(g) with probability approaching one as n —> oo. 

Let 0 k'h k, l — 1, 2,...} be a tensor product basis for L2[0,l] 2 , so that (j)^ = 0£0h 
We suppose this basis is uniformly bounded and orthonormal. Consider the density p having 


a pairwise factorization 1.2, so that / = logp can be expressed with the basis expansion 


00 


oo 


f(x) = f 0 (x) + 0Q + 

i,jGV k,l =1 i&V k= 1 


( 1 . 11 ) 


exp /o is some base measure which has the same pairwise factorization as p. We will take 
/o = 0, but our results also apply whenever /q has the same smoothness assumptions as /. 
9, cj) represent the parameters and sufficient statistics vectorized. 9 e , 6 V denote the vectors of 
edge and vertex parameters, respectively. For a / G IF 2 , for any i,j G V and r ? ; + rj = r, 
we additionally have that 


= Cl < OO, (1.12) 

k 

^2(e*f l ) 2 k 2r n 2rj = c 2 < oo. (i.i3) 

k,l 


This implies that (9*)% = o(/c _r ~ 1//2 ) and (9*)^j = o(k ~ Ti ~ l / 2 ). 

We denote II to be the set of densities on [0,1]0 For risk analysis in Chapter 2 we use 
the relative entropy, also known as the Kullback-Leibler divergence: 


KL(p | p) 


f P ^ bg =Ep 

if 



(1.14) 
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It can be shown that KL(p | p) > 0 and equals zero only if p = p //-almost everywhere. 
Relative entropy is a natural risk measure for density estimation. It is invariant to invertible 
changes of variables, and shares a natural connection to maximum likelihood. Convergence 
in KL is strong in the sense that it implies convergence in several other risk measures. In 
particular, define the L\, Hellinger and Total Variation distances as follows: 


D\{p | p) 
Dh(p I P) 
Dtv(p I P) 


I I p( x ) ~ p( x ) I dv(x), 

X 

J \p(x) 1 / 2 — p(x) l ^ 2 \ 2 du(x) : 

X 


sup 

AeX 


S p(x)iv{x ) 

A 


jp(x)du(x) , 
A 


(1.15) 

(1.16) 
(1.17) 


Reiss, 1989 Kullback, 1967 , we have 


By Pinsker’s inequality and 

KL {p | p) > 
KL {p | p) > 
KL {p | p) > 


DliP 1 P) 

(1.18) 

2 

Dh(p 1 pf, 

(1.19) 

2 D TV {p \p ) 2 - 

(1.20) 
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CHAPTER 2 


EXPONENTIAL FAMILIES AND THE EXPONENTIAL 


SERIES REGULARIZED MLE 


2.1 Exponential Series Approximation 


Consider an approximation to / = logp (1.11) by truncating the basis expansion in the 
following way: 


d mi 


W2 m 2 


log Pq(x) = EE e i^i) + E t 2 - 1 ) 

i,j£V k =1 Z=1 


i =1 fc=l 


This approximation is an exponential family with sufficient statistics corresponding to 
the basis functions {(f>k(xi),i G V, k < mi} and {(p^x^cpi^Xj)^, j G V, k,l < m 2 }. Z{9 ) is 
chosen so the density integrates to one. The idea of representing a density as an exponential 


expansion was first used for goodness of fit testing in Neyman, 1937 . Nonparametric es¬ 


timation of univariate distributions using exponential series has been studied previously in 
Crain 1974, 1977 Barron and Sheu 1991|. For simplicity we assume that each univariate 


component is truncated after rri\ terms and bivariate components after m 2 terms. In prac¬ 
tice we could attempt to vary truncation for each variable, though this could be unwieldy 
for large problems. When not ambiguous we will write log pg(x) = (9,(p(x)) — Z(9). Given 
n i.i.d. samples ... ,X n , the exponential series MLE estimator finds the regularized 
maximum likelihood estimator of 9: 


9 ■= argmin < —- ^ \ogpg(X r ) + \1Z{9) > , 


r= 1 


= argmin | — C{9) + A77(6*)|. 


( 2 . 2 ) 

(2.3) 


77 is a convex regularize^ we discuss our particular choice of regularization in Section |2.5.1 

The resulting density estimate is Pq{x). The fact that 9 depends on the truncation parameters 
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mi, m 2 and regularization parameter A is left implicit. 

Exponential series is a natural approach for the estimation of pairwise densities. The 
product factorization of pairwise densities can be expressed naturally by exponential series. 
Indeed, many common parametric graphical models are exponential families. Each trun¬ 
cated series forms an exponential family. We detail exponential families and the regularized 
maximum likelihood in the sequel. 


2.2 Exponential Families 

This section presents background on exponential families and their use in graphical modeling. 
For an in-depth treatment, see Brown[ 1986 Wainwright and Jordan, 2008 . 


Consider the random vector X = (Xp...,JQ) taking values on the support X = 
®f = l X t . The exponential family with sufficient statistics <f(x) = (<fi(x), (t>2{ x )i • • •, 4>m { x )) 


is the family of probability distributions 


| P8 ■ Pd( x ) = exp j(0,0(x)> - Z(9) j,0 G pj, (2.4) 

where (9, <f>{x)) = Y^aLi @af>a( x )• Z{9) is called the log-partition function , and is given by 

^(0) = log^exp {(0,0)}z/(ote), (2.5) 

and ensures the density pg(x) integrates to one. The family is indexed by the parameters 9 , 
called the natural parameters belonging to the space V = {9 : Z(9) < oo}. 

An exponential family is minimal if there is no choice of parameters 9^0 such that 
(i 9,<f)(X )) = C zz-a.e., where C is a constant. If a family is minimal, there is a bijection 
between the natural parameter space V and the densities belonging to the family. An 
exponential family is regular if V is an open set. 

Example 2.2.1 (Gaussian exponential family). Consider the Gaussian density with mean /i 
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and covariance £ >- 0. The Gaussian family is an exponential family with sufficient statistics 
{x, xx T } and natural parameters (9\, 9 2 ) = {O/x, — ^fl}. Since £ >~ 0, the natural parameter 
space is 

v = {{6i, e 2 ) e R d x R dxd : 9 2 ~< 0}. 


The space of negative definite matrices is an open convex set, so it follows that V is convex 
and the Gaussian family is regular. Furthermore, linear independence of monomials implies 
that x and xx ^ are linearly independent, and so the Gaussian family is also minimal. 


Remark 2.2.1 (Exponential series). Consider the exponential series family of (2.1). When 
{(} i s an orthogonal basis which satisfies the Haar condition, the exponential series 
family corresponds to a minimal exponential family. Furthermore, since the exponential 
series are defined have compact support [0,1]^ and the sufficient statistics (j) are bounded 
above and below, any choice of 9 6 R^ d will produce a valid Z{9) < 00 ; in other words 
V = which is a convex and open set, so the exponential series family is regular. 


2.3 Mean Parametrization 


An exponential family is parametrized by its so-called natural parameters 9 (2.4). Alterna¬ 


tively, an exponential family may also be characterized by a vector of mean parameters p. 


The connection between these seemingly disparate entities will be described in Lemma [2. 3.1 
Let p be any density on X with respect to v. We define the mean parameter fi a corre¬ 
sponding to the sufficient statistic (j) a by 


Ra= J p(x)<j>a(x)dl/(x) = Epical- (2-6) 

Consider the set of vectors that correspond to the moments of some distribution: A4 = {/1 : 
3 p 6 LI, p a = Ep[0 a ], Va = 1,..., M }. In particular, the elements of M. need not correspond 
to mean parameters of an exponential family. Furthermore, A4 is a convex set. To see this, 
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let fii, H 2 € Ai be mean parameters corresponding to two distributions pi,P 2 with respect 
to v. Then 

Apt + (1 — A)p2 — ®Api+(l_A)p2 


For discrete variables, one can further show that Ai is a convex polytope Wainwright and 


Jordan, 2008 . This fact is exploited in many inference algorithms for discrete graphical 


models, but this is not true for continuous random variables. Characterizing Ai for general 
continuous sufficient statistics is in general very challenging. It is closely connected to the 
so-called moment problem [Landau,[1987 which has been studied since the late 19th century. 
For polynomial sufficient statistics, Ai can be characterized by a sequence of semidehnite 


constraints on the moments Lasserre, 2009 . This is suggested by the positive semidehnite 
constraint on the covariance matrix E for Gaussian densities. 

We now state several important facts relating the natural parameters 9 and the mean 


parameters //. For proofs, see Wainwright and Jordan, 2008 


Lemma 2.3.1. Suppose 9 corresponds to the natural parameters of an exponential family 
with sufficient statistics (j>{x) and corresponding mean vector p,{6) = E pe [(p\. Let Z(9) be the 
corresponding log-partition function, and define its gradient VZ(9) : V —> Ai. The following 
hold: 


1. 9 and /j are related by the mapping 


VZ(9) = ii(9)-, 


(2.7) 


2. V 2 Z(9) =cov 0 [</>]; 

3. Z(9) is a convex function, and strictly so if the family is minimal, so that J T (cov5/[0])J > 
0 for each 6 ^ 0 ; 

4. The mapping VZ(9) : V —> Ai is one-to-one if and only if the family is minimal; 
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5. The mapping \7Z(9) is onto the interior of At, int(At) if the family is minimal. 


Remark 2.3.2. The exponential series family is minimal when the orthogonal series satisfies 


the Haar condition , so its likelihood is strictly convex, and (2.3) is a convex problem so long 
as the regularizer 1Z is convex. 


2.4 Duality 


For any function Z taking values 9 G V, we define the Fenchel conjugate , Z*, as follows: 


Z*(ij.) = sup eev l(0,n) - Z(6)}, 


( 2 . 8 ) 


When Z corresponds to the log-partition function this equation bears a strong resemblance 


to the maximum of the log-likelihood (2.3). Indeed when p corresponds to empirical mean 
parameters p = -g J2k =i 0(A" fc ) if is precisely that, though (2.8) is well-defined when p 
doesn’t correspond to a p G At. 

For /i G intAt corresponding to a minimal family, let d(p) denote the unique natural 
parameters corresponding to p. Denote 


h (p) ■■= H(p e(/u) ) = -E Pfl(/x) [log p e{fl) ] (2.9) 


the entropy of the density Pg( fl y Furthermore, denote the univariate entropy by 


H iM ■= - E P»( rt [ lo g!>i,8(M) (V)], 


( 2 . 10 ) 


and bivariate mutual information 


hj (aO 


log 


Pij9(/j,) i Xj ) 

Pi,9(g) i-^-i)Pj,9(fj,) (^-j) 


( 2 . 11 ) 
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Theorem 2.4.1. The Fenchel conjugate of the log-partition function Z is given by 

{ —H(p), p G mtXt: 

( 2 . 12 ) 

oo, p M ; 

for p G M\intM., Z*{p) is given by the limit of Z*(p^) for any sequence {p^}, p^' G intXi. 

Example 2.4.1 (Gaussian Entropy). Let A" ~ 1V(0, E), E >~ 0, and 12 = E -1 . Denote p the 
density of X. Then 


-Ep[logp] = ^ log(2vr) - ^ log |fi| + ^E p [X J nX}. (2.13) 

Now, A^ T 12A^ = trace(12AGE T ), and since the trace is a linear operator, we may move 
the expectation inside the trace, giving 

E p [X T 12X] = trace (l2E p [XX T ]) 

= trace (12 E) 

= traee(J^) = d. 

Thus the Gaussian entropy is 

^ (1 + l°g(2vr)) — ^ log 1121. (2.17) 

Example 2.4.2 (Tree Entropy and Maximum Likelihood Trees). Let 6 be the parameters of 
a minimal exponential family which is tree-structured: that is, any edge parameters 9j.j = 0 
for (i,j) fL T, where T is the edge set for a tree. Because of the tree density factorization 


(2.14) 

(2.15) 

(2.16) 
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(1-4), 


H W)) = - E [!og Pofr)] (2.18) 

= Y,HMO))- E hM e ))- (2- 19 ) 

iev (*j)er 

Thus, for a tree-factored distribution has a simple expression for its entropy in terms of the 
univariate entropies and bivariate mutual informations. 

2.5 Main Results 

2.5.1 Sparsity 

For our risk analysis, make the sparsity assumption on 9*, that 9* G V, where 

V = V(E) := 1 9 : 11 9{j 11 2 = 0, V(i, j) £e\<ZV. (2.20) 

However, the set E is unknown. Recall that for the pairwise graphical model, (i,j) £E 
when 9'jt = 0 for each k,l = 1,... , oo; in other words, when \\9 1 ^ ||2 = 0. For clarity, we refer 
to 9*, [ly to be the vectors of vertex parameters, 9 ei y e to be the vectors of edge parameters, 
and 9ij,yij to be the vector of parameters corresponding to edge To encourage edge 

sparsity we consider the penalty 


m)= E ii®ijii2- I 2 - 21 ) 

1Z defines a norm over the edge parameters. Furthermore, 1Z has some special properties 
which we detail below. 
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Proposition 2.5.1. The dual norm of 7Z is 


TZ*{6 e ) = max ||6y| 2 . 
(id)€E 


( 2 . 22 ) 


TZ is known as the (l,2)-group penalty, and its dual the (oo,2)-group penalty. In our 
application, the groups correspond to parameters of given edges. Group penalties are best 


known from their use in the group lasso Yuan and Lin, 2006 , which is used to encourage 


group sparsity in regression coefficients. For a vector of parameters 6 denote its projection 
onto V by dp = j)eE W^ijlh an(4 its projection onto its orthogonal complement by 
= Z(ij)eEc 11 @ij 11 ■ 

Proposition 2.5.2. TZ is decomposable with respect to V. That is, 


1Z(5 + d) — 7Z(S) + TZ(9), 


(2.23) 


for each S G V 1 - and 6 eV. 

The following proposition characterizes the subspace compatibility constant for 1Z , which 
is necessary in the proofs. 

Proposition 2.5.3. 


sup < VWV ( 2 - 24 ) 

UGV\{ 0 } 


We will state our main theoretical result for the regularized exponential series MLE. A 
full derivation of the results are in Appendix |A} 

We start with three assumptions: 


Assumption 2.5.4. Haar Condition-. Any truncated collection of basis elements{d} is 
linearly independent v — a.e.. 
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Assumption 2.5.5. The univariate basis functions satisfy for each k, |0^.| < b[k) = 0{k a ) 
for some a > 0. 

Assumption 2.5.6. For each i,j E V, 

e<pij(xi,xj)<e, (2.25) 


for absolute constants e > 0, e < oo. 


These assumptions are mild. Many bases satisfy assumptions (1) and (2), such as the 
standard polynomial or trigonometric bases. For the orthonormal Legendre basis, < 

with a = \. The use of an overcomplete basis 


V2k + 1 , so it satisfies Assumption 


2.5.5 


creates some statistical difficulties as the resulting truncated exponential family is no longer 


minimal. Assumption 2.5.6 is mild for density estimation as we only require boundedness of 
the bivariate marginals of p rather than of p itself. 


The natural first question is whether the problem (2.3) has a solution at all, and if so, 


how many solutions. We begin by showing the existence and uniqueness of (2.3). 


Lemma 2.5.7. Suppose n > m\. The solution (2.3) exists and is unique with probability 


one. 


2.5.2 Risk Consistency 

We now present consistency of p^ in terms of the KL risk. 

Theorem 2.5.8. Suppose that the regularization parameter is chosen to be 


m 2+4a \ 0 g^ m2( pj 


'n 


n 


(2.26) 
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and the truncation parameters satisfy 


mi = Q 2 ) , 

m 2 = £7 |i£| r-a- 1/2 


(2.27) 

(2.28) 


Then the regularized exponential series MLE 9\ satisfies 


KL(p | pz) = O p m 2 2/ | E\+m 1 Ar d + 


— 2 r j , m| +4a log(dm 2 ) , ^ , mi 


n 


n 


(2.29) 


Corollary 2.5.9. T7ie optimal choice of truncation dimensions m\, m 2 is 


m\ x max < n 2r + 1 , d r ~ a ~ C 2 > , 


m 2 x max < n 2r + 2 + 4a , |i7| r_Q_1 / 2 


Consider typical choices of r = 2, a = \ (such as the Legendre basis). 


The dimension d and edge cardinality \E\ may scale as 


d = o{y/n), 
\E\= o(n5/log(n)), 


(2.30) 

(2.31) 


with the risk still approaching zero as n —> 00 . 

• Suppose that d = O(nS) and \E\ — 0(ns). Then by choosing 

1 

mi x n 5, 

1 

m 2 x , 

log 1 / 2 (rad) 

A X y , 

n 4 


(2.32) 

(2.33) 

(2.34) 
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the risk decreases as 


KL (P I Pn) = O p 


\E\ log(dn) d 


n 2 


+ T 

715 


(2.35) 


Remark 2.5.10. The result of 2.5.8 holds uniformly for any set B of pairwise densities p with 
bounded Sobolev norm. In particular, assuming \ E \— o(n 4 / 2 ) and d = o(n 4 / 5 ), 


Remark 2.5.11. Theorem 


lim lim sup P (~KL{jp \ p) > I I dn ^ — q. (2.36) 

t->oc n->oo peB \ „2 n5 ) 


2.5. 


shows that the risk of p^ adapts to the unknown sparsity of 
6* , in that the risk contains a factor of \E\ rather than d 2 . However, this is not sufficient for 
model selection consistency, which requires further assumptions. In particular consistency 
in KL risk does not require an incoherence condition. We consider model selection in the 
next section. 


2.5.3 Model Selection 

Our result for model selection consistency requires more stringent assumptions, in addition 
to those in the previous section. We denote the vector of truncated parameters by 9*. We 
index the (infinite) vector of omitted parameters by T, so that the vector of parameters are 
9j,. We use the subscript E to denote the collection of parameters of edges in E (as well as 
all vertex parameters), and E c to denote parameters for edges in E c . Denote the covariance 
matrix of the sufficient statistics (ft by 


T := co v p [(ft]. (2.37) 

For two index sets A,B denote Tab to be the cross-covariance between (ft a an d (ftsi 

cov[(ftA ,0b]- 
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Assumption 2.5.12. For a constant < oo, 


I r ^ll2 - 


«r 


Vd + W\ 


m 


: = IIT^tIIoo- 


(2.38) 

(2.39) 


where UAUoo = mah ere denotes the matrix oo norm and ||v4 ||2 the matrix 
operator norm. Additionally, we define the following: 




(2.40) 


where 


U = (life - E„ t fe]||l + II 4>T - E <?t [flV]|| 1 ) 2 , (2.41) 

and 9 1 := 9* + z(9 — 9 *) and z G [0,1]. 

Assumption 2.5.13. For some kr < 00 , for all z G [0,1] and all 9 satisfying 


9 E c = 0, (2.42) 

11^1; — #eI|oo < 2 ky (|| He ~ f^E 11 00 + A n /m + {kr + l)||0j’||oo) 1 (2.43) 

we have that 

11 % II 00 < kr max{ 6 (m 2 ) 2 , b{m\)}. (2.44) 

This assumption may appear opaque so we will elaborate. If pq is bounded, the third 
central moment of the univariate statistic is 
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(2.45) 

(2.46) 


E eUk ~ ^elh)) 3 } < KmeU k - ^elh)) 2 } 

< eb{k) j <j>\ 

< eb(k). (2.47) 

This holds similarly for bivariate sufficient statistics. Kq\ is a sum of a third central 
moment of a sufficient statistic and the third cross-moments of that statistic with the other 
sufficient statistics. We thus require that the third cross central moments between sufficient 
statistics decay sufficiently rapidly, so that the sum is on the order as stated. In our proof, 
this factors in to the remainder term , which is the bias from truncating the infinite expansion 
of the log density. Finally, we have an irrepresentable condition 

Assumption 2.5.14. 


^J T ^ T EEh<^=, 


for some r G (0,1]. 


(2.48) 


Here ||A ||2 is the matrix operator norm. This condition is reminiscent of the irrepre¬ 


sentable condition for sparse additive models in Ravikumar et al. 2009 , in that it involves 


the operator norm rather than the matrix oo norm. It guarantees that no sets of variables 
in E and E c are too strongly influenced. In the following theorem we assume kTi k Ti k R 
grow as constants, though they are tracked in the supplementary lemmas. Finally we define 
p* = mm W)gB Halloo to be the minimum oo norm of the edge parameters. 


Theorem 2.5.15. Denote E to be the edge set learned from 9\ ; E := {( i,j ) : = 0}. 
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If the truncation dimensions mi, m 2 and regularization parameter X n satisfy 


1 


m 2 X n 2r+4a+l ; 

1 

mi X n2r+4a+i ; 


log(nd) 

2r—1 
2r+4o;H-l 


and suppose that the number of variables d and p* satisfy 


d = o I e 

( 


2r—l 

^2r+4aH-l 


— = O 

p* 


V 


2r+l 


2t’+1+4q 

\ log (rid) 


then 


P (E = E)->1. 


(2.49) 

(2.50) 

(2.51) 


(2.52) 

(2.53) 


(2.54) 


2.6 Discussion 


Proofs and supporting lemmas for this chapter may be found in Chapter 5, but we will briefly 
discuss the results here. In Barron and Sheu 1991), the optimal choice of truncation for 


1 


univariate exponential series approximation was mi x n 2r + 1 , and for the bivariate problem 
1 

we have m 2 x n 2r + 2 . Our truncation is of a lower order. In our proofs, in order for our 
estimator to adapt to the unknown sparsity of E we require exponential concentration for 
the sufficient statistics. To do this, we use Hoeffding’s inequality, which gives that 


Hj 


PijW ~ 



(2.55) 
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Barron and Sheu 


19911 use Chebyshev’s inequality, gives a tighter bound of y but 


doesn’t give exponential concentration. 

Our model selection results require conditions on the covariance matrix of the sufficient 
statistics T, particularly an incoherence condition. This is natural; for example, for Gaussian 
graphical model selection, the same incoherence condition is required on the covariance 
matrix of the sufficient statistics; in this application the covariance has the simple expression 

For the typical choices of r — 2 and a — the dimension 


Ravikumar et ah 


2011 


may scale nearly exponentially with the sample size, 


d = o ( e 


3 

,71 7 


(2.56) 


and p* may scale as 


n 


5/14 


— = o 

p * 


log 1 / 2 (nd) 


(2.57) 


with E = E with probability approaching one. Observe that the KL risk of may diverge 
rapidly while still having the correct sparsity pattern with high probability. For the para¬ 
metric Gaussian graphical model the optimal rate is o(e n ), jRavikumar et ah, 2011 which 
is also the optimal rate for forests Liu et al. 2012b]. The optimal choice of m 2 for model 
selection is larger than that for the risk analysis, so by oversmoothing the edge potentials, 
we get better sample complexity for the model selection problem. This phenomenon was 
also found for graph selection for forests Liu et al. 2012b |. 
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CHAPTER 3 


TREE-REWEIGHTED VARIATIONAL LIKELIHOOD 

APPROXIMATION 


First-order optimization procedures for solving the regularized maximum likelihood problem 
requires evaluation of Z{9) and its gradient VZ(9) = f x pg(x)(j)(x)di>(x) . For a pairwise 
density on d nodes, these computations still require a d-dimensional integral even when the 
graph is sparse. Using the junction tree algorithm |Kollcr and Friedman, 2009 , it is possible 
to factorize the joint density into terms which have no more variables than the treewidth of 
the graph, making these calculations simpler, ffowever, the treewidth of a graph may in 
general be large, and we are interested in procedures which work for general graphs. 

Monte Carlo methods are one popular approach for approximating partition functions 


Gilbert and Nocedal, 1992 . ffowever, it may take a very long time for suitable conver¬ 


gence, and such methods generally don’t provide finite-time bounds on the accuracy of the 
approximation. 

ffere we pursue a tree-reweighted variational approach |Wainwright et ah, 2005 , which 
replaces Z{6) by a surrogate Q{9). Our method has several key advantages: 


1. ft computes an approximation to Z{9) and its gradient VZ(9) all in one pass using a 
parallelizable message passing algorithm; 

2. Q(9) is guaranteed to be an upper bound on Z{6) for each 9, with the tightness dictated 
by variational parameters; 

3. ft is based on convex optimization of the variational parameters, so there are deter¬ 
ministic stopping criteria for computing Q(9); 

4. Q(9), like Z(9), will be strictly convex in 9, so there are criterion for global convergence 
of the approximate maximum likelihood. 
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3.1 Problem Formulation 


Recall that a density with a tree graph can be factorized in the form (1.4). A density 
corresponding to a graph G with cycles cannot in general be factorized. Instead we will 
consider the collection T of spanning trees of G. Consider an exponential family following 
the graph G having natural parameters 9. For a spanning tree T e T, consider the vector 0 T 
that obeys T: Ofj = 0 if (i,j) 0 T; in shorthand we write 6j- to be the vector of parameters 
corresponding to edge 

Now consider writing the parameter value 6 as a convex combination of spanning tree 
parameters: 


0 = Y, a T 6 T , 

TgT 

by the convexity of the log-partition function Z ( 9 ), we have 


Z{6) < a T z (0 
TeT 


(3.1) 


Now, we may form the tightest upper bound on the log-partition function by solving 


Q( «) : = min < a T z (f 
I 0 Irer I TeT 


(3.2) 


s.t. 


9 = 'y ^ a r p9 
T 


T 


Observe that since Z is a convex function, (3.2) is convex in {9 T }t^Ti anc ^ ^ ie constraints 
arc linear, so the problem is convex. However, the number of spanning trees of a general loopy 


graph G could be very large, perhaps even super-exponential in the number of edges Cayley 


1889 , so the number of parameters to minimize over is in general very large. Contrary to 


expectation, it is possible to efficiently solve this problem. To begin, we will look at the dual 
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problem to (3.2). 


3.2 Solution to the Dual Problem 

For an edge write 

a ij = a TH(hj) e T }- (3-3) 

TeT 

This is the edge appearance probability, the probability edge (i,j) is observed when drawing a 
spanning tree at random according to the distribution {ay}. The set of such edge appearance 
probability vectors { aij} which can be written as convex combinations of tree indicator 
vectors, is known as the spanning tree polytope , which we denote Sy. Let 


n = / Qij( x i,Xj)dxj = qi(xi), / q iq = 1, 


(3.4) 


X; 


Xi x Xj 


Qi = 1 ,Qij > 0 ,q t > 0 e E y 


Xi 


M = <ii:3{q l ,q l j}eIl:E qi [(j) i \=yi,E qij [(j) i: i}=ii i j,(i,j)eE [ >. (3.5) 


That is, fl is the set of univariate and bivariate densities over E which respect marginaliza¬ 
tion, and A4 is the set of mean parameters which can arise from elements of If. To derive 


the dual problem to (3.2), we first define the Lagrangian 


e) = ^2 arZ y) + rT {0~ *22 a T® T ) (3-6) 

TeT TeT 

= M>- 22 a A(E0 T )-z(e T )) (3.7) 

TeT 

To minimize L with respect to 0 T for a given T € T, we set the associated derivative to 
): = 0. Denoting the optimum by dj, the solution may be written in terms of the 


zero: 




Fenchel conjugate of Z (section |2.4[) : 


z t( t ) := max \ ( T > ° T > ~ ) 

o 1 


T\ 


= {t ,ej)-z(6i) 


T\ 


(3.8) 


Recall that the dual of Z for a tree-structured parametrization (2.19) takes the form 


Zf(r) = ^tf,(T)- £ /y(r), 

i£V (ij)eT 


r e M, 


(3.9) 


and so the dual to (3.2) is 


max < (0, t) + I] H i ( z) ~ J2 

teAA. \ ( m '\/-T7 ' 

lEV (i,j)eE 


(3.10) 


For some distributions, such as discrete pairwise models and Gaussian models, the entropy 


and mutual information have a closed form and (3.10) can be solved explicitly. Unfortunately, 


for continuous models there is typically no such expression. In the following section we show 


that (3.10) is equivalent to a functional optimization problem, which we solve using message 


passing. 


3.3 Functional Message Passing 


Observe that the dual problem finds an optimum over the space of mean parameters realizable 


by distributions in II, so (3.10) is equivalent to the following functional optimization: 


ma ~ x i Yl ( ( 0 b E gJ<Aj]> ~ E gJ lo g%] 
gen l ^ 


+ 'y ^ ( {®iji E gp [fiij]) a ij^qij 

(hj)&E 


log 


Qij 

mj 


(3.11) 
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If q *, q*- are solutions to (3.11), the solution to the dual problem (3.10) is given by 

L LJ - 


T*(«) = E,.[0(], 

T m 


i 6 V, (3.12) 


C i,j) 6 E. (3.13) 


The optimization (3.11) is an optimization of a convex functional over a space of linear func¬ 
tional constraints II. Any solution to the stationary conditions of the associated Lagrangian 
will thus correspond to a global optimum. We may derive the stationary conditions using 
standard arguments from calculus of variations. Write the constraints as 


Ci (ft) 


1 - 


ft ( x i) dx i 


Cij { x ji Qij) 
Cji{{ x i, qij) 



(3.14) 

(3.15) 

(3.16) 


and let r/j, r^j (xj) be the Lagrange multipliers associated with these constraints. 

The second and third multipliers are real-valued functions. This gives us the stationary 
conditions 


log ft ( x i) 


a ij log 


Qij ( x i > 


Xa 


Qi{xi)qj{xj) 


(0i, M x i)) + dri + Vh 
r&N(i) 

{Qij, 4>iji x i, x j)) ~ Vji ( x i ) - 'Qij { x j) , 


we may simplify the second condition to get 


(3.17) 

(3.18) 


log qij ( X{,Xj ) = ru + ijj + (9ij,<j>ij(xi,Xj))/atij 
+ (9i,M x i)) + ( 9 j^j( x j)) 

+ 'y ] Qri { x i ) / a ij T y ] Qrj ( x j) / a ij- (3.19) 
r£N(i)\j r&N(j)\i 
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Figure 3.1: Bivariate pseudo densities under increasing regularization. Left plot is unreg¬ 
ularized; right plot is fully regularized (dimensions are independent). Simulated data is a 
mixture of three spherical Gaussians. 


For each (i,j) G E , we define a message My : Xj —> M, and Mji : —> R, by 

(3.20) 

(3.21) 


Mij (xj ) = e m ^ x ^V ai X 

Mji (xj) = e v ^ Xi ^ ai ^ 


so that the solution to (3.11) takes the form 


Qifa) oc exp{(0*,^)} TT M J{ (xi)M' , (3.22) 

j&N(i) 

q i j{x il x j) oc exp{(0y,0y)/oiy + (d^fa) + (9j,<f>j)} (3.23) 

^11 reN(i)\j M r i( x i) ”11 reN(j)\i^rj { x j) ^ 

M ji {x i ) l - a ^M i j(x j ) l - a * 

These are pseudodensities : they are valid densities which obey the marginalization con- 
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1. Initialize messages < >; 

L J J L ) 

2. For n — 0,1,.. 

Update beliefs: 

b™ +l (xi) oc exp (0*, <j>i) JJ Mfy (; Xif ri ; (3.25) 

r&N(i) 

Update messages: 

M if l ( x j) j exp{(0y, <tHj)/(Hj} | | dx i- ( 3 - 26 ) 


Figure 3.2: Functional Belief Propagation 


straints, but they may not together correspond to the marginal distributions of any higher¬ 
dimensional joint distribution. By enforcing the marginalization constraints for {qi}, {qij}, 
we find that the messages follow the fixed-point conditions 


M ij ( x j) x / ex P {{OijAij)/<Xij + {QiAi)} 


rirG7V(i)\i Mri ( x i 


Xi 


M, 


J* 


, -dxj 

Xi) 1 - 0 * 


= I exp Wij, 4>ij)/uij)} | } dx i' 


(3.24) 


To hnd the fixed point corresponding to the pseudomarginal densities we run the algorithm 


in figure 3.3 to convergence: 

The beliefs are normalized to integrate to 1, so they correspond to a proper density. 
At convergence, the beliefs correspond to the univariate pseudodensities {<?*}• The 
message updates can be performed in parallel. Furthermore, more elaborate schedules exist 
which may speed up convergence. For example, updates can be formed dynamically. Also, 
message updates corresponding to beliefs which have reached convergence can be skipped. 
See [Gonzalez et ah, 20ll for a treatment on different parallel scheduling methods. If the 
fixed point updates do converge, they will converge to the unique fixed point corresponding 


to the global minimum of (3.10). The fixed-point updates are not guaranteed to converge. 
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In our experiments we only encountered stability issues after taking too large of a step in 
the ISTA algorithm for estimation, resulting in an unstable candidate step; if we encounter 
a convergence problem we simply take a smaller step. As such we don’t find the need for 
damping or other techniques to encourage convergence. 

To perform message passing in practice we discretized messages and approximated inte¬ 
grals using a Riemann sum approximation. We found this to give very accurate results in 
experiments. Other approximations for continuous message passing exist |Noorshams and 


Wainwright 2013 Sudderth et al. 2010 which could be more memory and computation 


efficient for large problems. 


3.4 Optimizing Edge Weights 

The previous analysis outlines how to compute Q(9, a) for a set of fixed edge weights 
In this section we show how the edge weights can be optimized to produce tighter bounds 
on the likelihood by solving 

Q(8) := min Q{6,a). (3.27) 

a&Sj- 


Our analysis follows that of |Wainwright et alT 2005 . From Danskin’s theorem Bertsekas 


1999 , observing the form of (3.10) it follows that the function Q{9 , a) is convex as a function 
of a, with gradient 

V aij Q(6,a) = -I ij (T*(6)), (3.28) 


where t*(6) are the pseudomoments from solving the dual problem (3.12). (3.27) is 


the minimization of a convex objective over a convex polytope, so it is a convex problem. 
However, the number of constraints characterizing Sj- are typically prohibitively large. To 
avoid dealing with them directly, we employ the following strategy. Suppose we have the 
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1. Input edge weights {w^j }; 

2. Initialize edge set T (J = 0; 

3. For k — 1,. .., d — 1 : 

Find largest w^*j* such that U ( i*,j *) doesn’t form a cycle; 

Set T k «- T fc_1 U (i*,j*). 

4. Output edge set T d ~ l . 

Figure 3.4: Kruskal’s Algorithm 


current iterate V. We linearize Q(0,a) about V and solve 

min s V a Q($, a) T (s — a*), (3.29) 

s.t. s G (S 7 -. 


This is a linear program, so the solution must always fall on at least one vertex of 
S']-. From observing the structure of Sj- as being supported by spanning tree indicator 
vectors, a solution s is equal to the indicator vector of a maximum weight spanning tree with 
weights {Iij{r* (60))}. Finding the maximum weight spanning tree can be done efficiently 


in 0( .E log E ) time using Kruskal’s algorithm (Figure 3.4). Lastly we update the edge 
weights ca * + (1 — c)s, where c is a step size c G (0,1). To ensure convergence 

guarantees, c can either be set to c = or it can be chosen using line search. This 


technique is known as the Frank- Wolfe algorithm Bertsekas, 1999 , and is known to converge 
at the rate of 0{l/t). 


3.5 Variational Maximum Likelihood 


Variational regularized maximum likelihood replaces the regularized maximum likelihood 


equation (2.3) with 


max j(0, fi) - Q(9) — \TZ{9 e ) }>, 


(3.30) 
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Q ( 6 ) is the variational approximation to the true log-partion function Z (6). 

This shares many features in common with the regularized MLE. For example, by Dan- 
skin’s theorem VQ(0) = r*(0), the optimal pseudomoments from message passing. Further¬ 
more, V 2 ^Q(0) = coVq*((J) a , (fib), and the minimality of (j) implies Q is strictly convex, since 
for any a ^ 0, a T X7 2 Q(0)a = var 9 *(a T </>) ^ 0 v — a.e.. 


3.5.1 Optimization Algorithms 


Both the exact optimization problem in (2.3) and the approximate problem (3.30) can be 
written as minimization of a smooth (strictly convex) function plus a non-smooth convex 
function, 


min | — C(6) + XJZ{6) 1. (3.31) 


Several algorithms have been designed to solve problems of this form; for a review see Bach 


et al. 

2011 . We will focus on what are known as proximal gradient methods Nesterov 


2013 

Beck and Teboulle, 

2009 , a class of first-order methods which have proven effective 


for large scale, non-smooth optimization. 


ISTA 


The simplest such algorithm is called the iterative-shrinkage thresholding algorithm , or ISTA, 
which works as follows. Fix the current estimate 6 t . Linearize C about the current point 
and solve: 

= argminl- 9) + A K(0) + §||9 - 0*|| 2 }, (3.32) 

OeV l 2 J 

where L is a step size. The squared norm term is called the proximal term, which encourages 

the solution not to be too far from the current step A. After some manipulation, is can be 

made equivalent to 


PlO 1 ) = argminj \ 
OeV l 2 


9- ( 6* - -(-V£(0*)) 




(3.33) 
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1. Input current iterate A; 

2. Fix a Lq > 0, 5 > 1; 

3. Find the smallest nonnegative integer l such that for L r = 5^Lq, 

C(B*) - a,p L ,(e‘)) < <p L , (9‘) - 9‘, V£(0*)> + yllo* - 


i\ 112 


4. Update A +1 with (13.331), using L = <FLq 


(3.36) 


Figure 3.5: ISTA Line Search 

For the group regularizer 1Z{6) = Yh(i j)eE Ill'll > fh e solution has a closed form and is 
given by 


(pl(o% = e\-~ 


{PL{^))ij = 1 


V i £(0 t ) 

X/L 


(3.34) 


+ 


, i + , i , (3.35) 


and we set the update steps to 6j = (pL(0 f ))i an d ®\j — {PL{^))ij f° r each i e V and 
G E. (■)+ := max(-,0). In the absence of regularization, the proximal gradient 
method simplifies to gradient descent. When A > 0, due to the soft thresholding, it may 
produce exactly sparse solutions, where all edge parameters 9jj for a particular edge (i,j) 
are zero. This is an advantage over other methods which only produce a sparse estimate up 
to numerical error, and necessitate truncation. 

Since exact bounds on the Hessian of C aren’t known, we must choose L using line search. 
We employ the backtracking line search from |Beck and Teboulle, 2009 , in Figure 3.5 


FISTA 


The accelerated counterpart to ISTA is the fast iterative thresholding-scaling algorithm , or 


FISTA Beck and Teboulle, 2009 . It is analogous to the accelerated gradient method in 
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1. 

Input iterates 9 t ,9 t ~ l ,y t \ 



2. 

Fix a Lq > 0, 5 > 1; 



3. 

Find the smallest nonnegative integer l such that for L' = 

5 l L 0 , 



Ciy 1 ) - C{ Pv (y t )) < (pL^y 1 ) -y\ V£(/)) + y|| y l 

-pviy 1 ) II 2 , 

(3.37) 


l+\/l+4a? 



4. 

Set a t+ 1 - —^-, 



5. 

Update 9 f ‘ = p^y*), using L = 5 1 Lq, 



6. 

Update y t+1 — 9 t + — 9 t ~ 1 ). 




Figure 3.6: FISTA Line Search 


smooth optimization, which has shown to be an optimal first-order method for smooth 
optimization |Ncmirovsky and Yudin, 1983 . Instead of the new parameter iterates being 
a projection of the previous, it is a projection of a linear combination of the two previous 


iterates. The updates with line search are given in 3.6 


FISTA requires essentially the same computation at each iteration, in particular the same 
number of gradient evaluations. In addition to the standard proximal gradient algorithms, 
we found success initializing Lq using a secant rule: 


L 0 = 


( 9 * - 0*" 1 , -V£(0*) + w;^- 1 )) 

|| 0 t - 0 t —!||2 


(3.38) 


Typically we find initializing with the secant rule finds a direction of sufficient descent with 
little backtracking. 


Discussion 


ISTA and its accelerated counterpart FISTA Beck and Teboullc, 2009 have linear conver¬ 


gence, that is at the rate O(C^) for some C < 1, when — C is strongly convex. In contrast, if 
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—C is only Lipschitz, ISTA converges at the sub-linear rate 0(l/k), and FISTA at the rate 
0(1/C). Recall that with strong convexity, gradient descent also has linear convergence, 
so proximal gradient descent behaves like gradient descent despite the objective not being 


smooth. Thus proximal gradient methods have superior theoretical guarantees to competi¬ 
tors such as subgradient descent. The negative log-likelihood — C is only strictly convex, 


but it is strongly convex in a neighborhood of the solution |Kakade et ah, 20101. Thus we 


may think of these proximal gradient methods converging linearly after a sufficient ” burn-in” 
phase. 

FISTA, like the accelerated gradient algorithm has been shown to outperform ISTA in 


some real problems |Beck and Teboulle, 2009j. However, it does have some disadvantages. 


It is not guaranteed to decrease the objective after each iterate. Further, ISTA may con¬ 


verge rapidly when well-initialized. In our application, this will commonly happen, because 


parameters are estimated over a range of A, each solution used as a warm start for the next. 


3.5.2 Choosing Tuning Parameters 


As of yet we have not discussed how to practically choose the truncation parameters mi, m 2 
and the regularization parameter A. We suppose the existence of a held-out tuning set; in 
the absence, one may use cross-validation. We choose mi, m 2 , A to minimize the negative 
log-likelihood risk in the held out set. To save on computation, we use the idea of warm 
starts which we detail in the sequel. First, observe that the first-order necessary conditions 
for the regularized MLE are: 



(i,j) e E. 


i e V; 


(3.39) 


(3.40) 
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where Z denotes the sub gradient of the regularizer 7 Z, at 9 , which is 


Zij — 


{x:||x||<l}, ||0*j||=O, 


(3.41) 


lift 






o/w. 


0. t j = 0 when 


^ > ||Hij II • 


(3.42) 


When Oij = 0 for each (i,j) G E, it’s clear that /i.y = = 'jl-fij by independence and 

the moment-matching condition (3.40). This allows us to choose an upper bound A max such 


that the solution will have no nonzero edge parameters: 


A start > max Wfej - fiiiLjW. (3.43) 

{hj)eE 

The idea behind warm starting is the following: we begin by estimating 9\ gtarV which 
amounts to d univariate density estimation problems which can be performed in parallel. 
Then we fit our model on a path of A decreasing from A start, initializing each new problem 
with the previous solution 9\. The solution path for the regularized MLE is smooth as a 
function of A, suggesting nearby choices of A will provide values of 6 which are close to one 
another. 

We can also incorporate warm-starting in choosing mi, m 2 - For a given A, we first es¬ 
timate the model for first-order polynomials, corresponding to m\ = m 2 = 1. We then 
increment the truncation parameters by increasing the degree of the polynomial of he suffi¬ 
cient statistics. We augment the previous parameter estimate vector with zeros in the place 
of the added parameters, and warm start ISTA from this vector. 
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3.6 Experiments 


Our simulations were conducted on a workstation operating 23 Intel(R) Xeon(R) CPU E5- 
2420 1.90GHz processors using R with backend computations written in C + + and compiled 
using the Repp package |Eddelbuettel and Frangois 2011 . Calculations, including message 
passing were parallelized using the Threading Building Blocks C + + library. We discretized 
messages uniformly over [0,1] on a grid of 128 points, and bivariate pseudodensities approx¬ 
imated on a 128 x 128 grid. All integrals were approximated using Riemann sums over these 
discretizations. We fit our model using the ISTA algorithm, stopping after an objective 
improvement of less than 10 -4 or 1000 iterations. We selected the tuning parameters as de¬ 


scribed in Section [3.5.2| using warm starts. To choose the edge weights, we generate a series of 
random spanning trees and take the average edge appearance probability as the edge weight. 
This produces a valid vector in the spanning tree polytope. We conducted experiments opti¬ 


mizing edge weights by the algorithm in Section |3.4[ but found the risk improvement in our 
simulations to be small relative to the additional computational requirement. We generated 
three types of data. First we generate independent, identically distributed random Gaussian 
vectors each with mean {0.5,... ,0.5} and covariance E. E is scaled to have diagonal 1/8 2 
and sparse off-diagonals. The sparsity pattern was generated by randomly including edges 
with probability 2/d, so the expected number of edges is d — 1. Furthermore, we gener¬ 
ate non-Gaussian data by generating Gaussian data by the aformentioned procedure and 
marginally applying the transformation y = signflc — 0.5) | x — 0.5 /5 + 0.5. That is, the 

transformed data is distributed as a Gaussian copula. These two models follow a pairwise 
factorization. Finally, we generated data as a mixture of three non-Gaussian distributions, 
each with edge structure of a randomly generated spanning tree. The distributions are given 
equal mixing weights. The resulting mixture of trees is not a pairwise distribution. 
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3.6.1 Risk Paths 


We first examine the risk paths, varying A and hence the number of included edges for typical 
runs of our simulation. We compare the TRW and it’s relaxed version (refitting the model 
under the selected sparsity constraint and setting A = 0) to the graphical lasso (using the 
glasso R package) and its relaxed counterpart. Examples of estimated pseudodensities are 
shown in |3.7 For Gaussian data, we see that TRW performs slightly worse than glasso for 
sparse graphs, while the gap widens as the graph becomes denser. There are two clear reasons 
for this. Since the Gaussian model is correct, glasso automatically provides a correctly 
specified model. TRW must select the number of basis elements and so may include too 
many (or few) parameters. Furthermore, the variational bound from TRW worsens as the 
graph becomes denser. For the non-Gaussian data, TRW clearly outperforms the glasso. 
The risk paths between TRW and glasso look markedly different. TRW demands many 
more parameters to get the optimal fit, so the performance from the relaxed TRW suffers, 
while the regularized version benefits by reducing overfitting, thus the relaxed and regular 
versions have very similar risk. As is well-known, model selection using held-out risk is 
precarious as the risk path is often flat. In our simulations TRW did quite well. Both glasso 
and TRW included two false edges for the Gaussian simulation, while TRW omitted several 
edges. For the non-Gaussian data, glasso included several more false egdes than did TRW. 


3.6.2 Density Estimation 


We continue by comparing held-out risk estimates to several other high-dimensional density 


estimation algorithms in Table 3.1 glasso, spherical Gaussian mixture model (with num¬ 


ber of components chosen by BIC, |Fraley and Raftery, 2002|), and the Kernel maximum 


spanning tree estimator of Liu et al. 2011 . We consider sparse Gaussian, sparse Gaussian 


copula and tree mixture data for dimensions between 30 to 120. For all simulations we set 

n — 100 and hold out 300 observations for testing. We repeat the simulations (but keeping 

the generating distribution fixed) 5 times and report the standard errors in parentheses. For 
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True TRW Glasso 





s • s ° s >< 


Figure 3.8: Risk path on simulated Gaussian data. Top row is negative log-likelihood risk 
on held-ont data; bottom row are selected graphs. Red edges are false inclusions; gray edges 
are false omissions. 
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Gaussian data, TRW performs competitively. It dominates the forest and spherical mixture 
density estimators but is slightly worse than glasso. For the Gaussian copula data TRW 
outperforms other methods; TRW can capture both non-Gaussianity and the cyclical depen¬ 
dence structure, while glasso can only capture the latter, and the forest density estimator 
the former. For mixtures of trees the results are similar to the copula simulation, with TRW 
dominating the other methods, despite the data not following a pairwise factorization. 


45 



!h to oo to co 

S H CO C5 

■R d C C C 

• O —- o o o 

S 

b- oo b- 
Ch b~ <D t-h O’ 

Q ^ ^ ^ 
‘co h CM lO 00 

1 CO CO b- 


CD CO N (N 
H CD LO 00 

^ T-1 LO T-H 

n —' o o o 

O CN CD CO 
• O CD 
H 00 CO CO 

H CO ^ 


00 (M ^ ^ 
OO CN ^ CO 
O CN CN iO 

o o o o 

CD (M CO 
M N ^ lO 
O CO CO H 


C/0 OO r-H r-H 

R o o 

q (M CO N CO 

fa O O O fa 


to \f Ol CO 

oc cm _; b- 


^ O to N 
CNJ ^ ON ^ 

O CO O CO 


oo co m oo 

ON GO LO CO 
i—I CN CO b- 

d co io ^ 


N lO OO 
IN O* to O 

o C 0 C 

' o o o 

05 

00 lO CO IN 

lo '—i i—i oo 
^ ^ co co 
1 cd co fa 


qzi ^ 00 ^ 05 

D LO CO ^ 
po CNJ CNj CO OD 

. o o o o 

Cd co io h lo 

^ OO N CM CO 

H N CD N ^ 

o ^ b^ oo 

CM CO D I 


CD CO CD 00 
CO LO LO 00 
CM CM ^ CO 

o o o o 


lO N H CO 
00 H O) 00 


O) CM 00 O) 
CM ^ (N ID 
q CM CM ^ 

s —•" O O O 
CD 

D N ID ^ 
OO N O CO 
^ rc- oo io 

• ^ cd 


S N ^ CO 00 

pd O lO CD CD 

□ CN CN 

^ o o o o 


io O) CO ^ 
b- I s - CD CO 
CM CM CO ^ 

O O O O 


CD 00 00 


O O O O 


^ CD CO H 
LO t —i CM <D 

O ^ 1^ O 
CM CO D 00 


CO Oi CD H 
T \t H LO LO H 
CO LO <D t—I 

CO LO O CN 


io N H lO 
t-h t-h CD CD 
CM CM ^ CO 
CO L0 6 CM 


cd 

00 

CM 

CD 

CD 

CO 

LO 

CO 

LO 

CM 

CD 

t-H 

CD 

CD 

i—i 

CD 

t-H 

i—I 

b- 

CM 

LO 



O 

LO 

CD 

Pd 

CO 

CM 

iO 

CD 

CO 


CM 


T-1 

CO 

CM 



cd 

cd 

o 

O 

o 

cd 

cd 

o 

cd 

O 

O 

O 

O 

co 

^ 

^—-' 

x 

v 

v 

s -—^ 

^—' 

^—' 

s -—' 

^ x 

x ^ 

x ^ 

CO 

CN 

CD 

00 

i-fa 

00 

CO 


LO 

CD 

CD 

iO 

LO 

ccj 

CM 

CO 

CD 


o 

CD 

LO 

r^ 

CM 

00 

b- 

iO 

o 

j— fa 

t-H 

CD 


LO 

t-H 

o 

CN 


GO 

CD 

GO 

CM 

LO 


i-fa 

cd 

cd 



O 

O 

oi 

CM 


1 

CO 

CD 

00 

1 

1 








c/o lO lo <D h 

^ N 00 h CO 

r_0 T -1 T-H ^ LO 

O o o - o 

CO 05 CD 

O J 05 N 

o S ^ LO 


H ^ io CO 
CD O CD ^ 
H CM CM CM 

o o o o 

CO H H 00 
LO ,—C CO <D 

^ V ^ co 


00 ^ CD H 

O CD b- ^ 
t-H t—H t—I LO 

o o o o 


0 N LO N 
O CD CO 
^ ^ ^ 00 


^ o o o o 

CO io O CM 


o o o o 

CO IO O CM 


o o o o 

CO io O CM 


46 



3.6.3 ROC Curves 


Figures 3.10 and 3.11| display ROC curves for two types of experiments. We generate non- 
Gaussian data as before, with n — d — 30. The curves trace the true negative and true 
positive percentages of the algorithms, varying the regularization parameter A. We choose 
mi, m 2 so that the resulting curve has the largest value of max^ T P(\)+TN (X). The plotted 


curves are an average of 20 repetitions. In Figure 3.10 the edges are chosen to be a randomly- 


generated spanning tree; in Figure 3.11 the edges are included with equal probability 2/d, 
which we call the ER graph. We compare the TRW estimator to the SKEPTIC estimator 


from Liu et ah, 2012a . The SKEPTIC estimator was particularly devised for estimating 
Gaussian copula graphical models, while our estimator is designed for a superset of those 
models. 

Overall the SKEPTIC performs better in terms of area under the curve (AUC). It also 
generally has a better TP% for moderate or small values of TN%. However the TRW 
estimator still performs quite well in these two metrics. For large TN% the TRW estimator 
manages a better TP%. We can only speculate on why the TRW estimate performs better 
here, but it may be because when the selected graph is very sparse it has few or no cycles, 
and the tree-based approximation of TRW is more powerful. This is consistent with the 
experiments, as the phenomenon is more pronounced for the tree simulation than the loopy 
graph simulation. 


3.6.4 MEG Data 


Magnetoencephalography or MEG is a neuroimaging technique for mapping brain activity 
using electrical currents in the brain. The resulting signals are high-frequency and have 
a complex non-linear relation to one another. There has been interest in using various 


neuroimaging techniques for mapping regional brain networks Kramer et al. 2011 , and 


particularly in understanding differences in connectivity related to neurodegenerative dis¬ 
eases 


Stam 2010 . We explore this on the MEG data from Vigario et al. 1998 , which 
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Figure 3.9: Performance on simulated Gaussian copula data. Top row is negative log- 
likelihood risk on held-out data; bottom row are selected graphs. Red edges are false inclu¬ 
sions; gray edges are false omissions. 
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Figure 3.10: ROC Curve, tree graph, u=d=30. Solid line: SKEPTIC estimator. Dotted 
line: TRW. 



Figure 3.11: ROC Curve, ER graph, n=d=30. Solid line: SKEPTIC estimator. Dotted line: 
TRW. 
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contains measurements from 122 sensors. We scale the data to be contained in the unit cube 


and remove large outliers (marginally larger than 6 standard deviations). Two features of 
this data are the temporal dependence of the signals and the presence of artifacts. Since our 
main motivation is graph estimation we will not address these issues, besides restricting our 
attention to a small timespan of the dataset. We use the first 400 observations in the series, 
randomly assigning 100 as training data, and the other 300 as test data. 

Inspecting two-dimensional projections of the data and their corresponding pseudoden¬ 


sities in Figure 3.12 the data displays clearly non-linear and multi-modal behavior which 
TRW can capture, but glasso cannot. We compare the glasso and TRW methods for graph 
estimation. We select a graph to minimize the hcld-out risk for the relaxed TRW, which has 
199 edges. We then estimate the graph using the graphical lasso, setting the regularizaition 
parameter to include the same number of edges. The estimated graphs are shown in figure 


3.13 For clarity, we color code vertices from the top four clusters produced from running 


the learned graphs through the community detection algorithm of Newman, 2006 . Note 
that the position of vertices here does not correspond to the actual location of the sensors on 
the scalp. Comparing the graphs from the minimum risk estimators, the two graphs share 
many features in common, each having one large connected component containing several 
smaller densely connected communities. However, they have clear differences, disagreeing on 
66 edges, or one-third of the edges in each respective graph. We believe due to the complex 
nature of the observed signals in imaging data such as MEG, our method may be able to 
bring new and better insights to understanding functional connectivity of brain networks. 


3.7 Discussion 

In this section we detail a tree-reweighted variational approximation for continuous-valued 
exponential families, which we apply to the exponential series estimator of chapter 1. Evalu¬ 
ating the variational likelihood involves message passing which can be effectively parallelized 
for high-dimensional problems, and provides a lower-bound to the likelihood (and upper 
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Figure 3.13: Estimated graphs from MEG data. Top: Graph estimated from TRW ; Bottom: 
Graph learned from glasso 
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bound to risk). We describe a proximal gradient algorithm for estimating the regularized 
MLE. Our experiments show this approach has very attractive performance in both risk 
and model selection performance, compared to other methods in the literature. We also 
demonstrate our method on a data set of MEG signals. 
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CHAPTER 4 


REGULARIZED SCORE MATCHING 


4.1 Introduction 


Undirected graphical models are an invaluable class of statistical models. They have been 
used successfully in fields as diverse as biology, natural language processing, statistical 
physics and spatial statistics. The key advantage of undirected graphical models is that 
its joint density may be factored according to the cliques of a graph corresponding to the 
conditional dependencies of the underlying variables. The go-to approach for statistical es¬ 
timation is the method of maximum likelihood (MLE). Unfortunately, with few exceptions, 
MLE is intractable for high-dimensional graphical models, as it requires computation of the 
normalizing constant of the joint density, which is a d-fold convolution. Even exponential 
family graphical models |Wainwright and Jordan, 2008], which are the most popular class 
of parametric models, are generally non-normalizable, with a notable exception being the 
Gaussian graphical model. Thus the MLE must be approximated. State of-the-art methods 
for graphical structure learning avoid this problem by performing neighborhood selection 


Yang et al. 2012 Meinshausen and Biihlmann 2006 Ravikumar et ah, 2010 . However, 


this approach only works for special types of pairwise graphical models whose conditional 
distributions form GLMs. Furthermore, these procedures do not by themselves produce 
parameter estimates. 

In this chapter we demonstrate a powerful new method for graph structure learning 
and parameter estimation based on minimizing the regularized Hyvarinen score of the data. 
It works for any continuous pairwise exponential family, as long as it follows some weak 
smoothness and tail conditions. Our method allows for multiple parameters per vertex/edge. 
We prove high-dimensional model selection and parameter consistency results, which adapt 
to the underlying sparsity of the natural parameters. As a special case, we derive a new 

method for estimating sparse precision matrices with very competitive estimation and graph 
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learning performance. We also consider how our method can be used to do model selection 
for the general nonparametric pairwise model by choosing the sufficient statistics to be basis 
elements with degree growing with the sample size. We show our method can be expressed 
as a second-order cone program, and which we provide highly scalable algorithms based on 
ADMM and coordinate-wise descent. 

4.2 Background 

4-2.1 Graphical Models 

Suppose X = (Xi,... ,Xd) is a random vector with each entry having support X,j, i = 
1,..., d. Let G = (V, E) be an undirected graph on d vertices corresponding to the elements 
of X. An undirected graphical model or Markov random field is the set of distributions 
which satisfy the Markov property or condition independence with respect to G. From the 
Hammersley-Clifford theorem, if X is Markov with respect to G, the density of X, p can be 
decomposed as 


p(x) oc exp ^ E tpcGc) 
. cecl(G) 


(4.1) 


where cl(G) is the collection of cliques of G. The pairwise graphical model supposes the 
density can be further factored according to the edges of G , 


p(x) oc exp ^ ^ if) i j(x il xj) 

i,jeV,i<j 


(4.2) 
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For a pairwise exponential family, we parametrize i^i , 'ip.jj by 


: = 6 i€( X i^ 

u<m 

i G V, 

(4.3) 

ij{xi,Xj ) := Ofjtij&uXj), 

(■ i,j) e E. 

(4.4) 


u<m 


Here m denotes the maximum number of statistics per edge or vertex. We denote 9 to be 
the vectorization of the parameters, 6 := (0^,..., , 9 ^ 2 , ..., dJ^) T . 


4-2.2 Scoring Rules 


A scoring rule Dawid and Lauritzen, 2005] S(x, Q ) is a function which measures the predic¬ 
tive accuracy of a distribution Q on an observation x. A scoring rule is proper if E p [S'(W, Q)] 
is uniquely minimized at Q — P. When Q has a density q, we equivalently denote the scoring 
rule S(X, q ). A local scoring rule only depends on q through its evaluation at the observation 
x. A proper scoring rule induces an entropy 


#(p) = Ep[S(X,p)], 


(4.5) 


as well as a divergence 


D(p,q) =Ep[S(X,q)- S(X,p)]. (4.6) 

An optimal score estimator is an estimator which minimizes the empirical score 

1 n 

-TS(X T ,,), (4.7) 

1 1 
r=1 

over some class of densities. 

Example 4.2.1. The log score takes the form l(x,q) := —log q(x). The corresponding 
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entropy is the Shannon entropy H(p) := —Ep[logp], its corresponding divergence is the 


log 


P(X) 


Q(X) 


and the optimal score estimator 


Kullback-Leibler Divergence KL(p | q) = 
is the maximum likelihood estimator. It is a proper and local scoring rule. This scoring rule 
was implemented in Chapter 2. 

Example 4.2.2. Consider the Bregman score , 


b{x,q) := -g'(q(z)) - / p{dy) (g(q(v)) - q(y)g'(q(y))), 


(4.8) 


where g : —>• M. is a convex, differentiable function and p is some baseline measure. The 

corresponding entropy is H(p) = — f p(dy)g(p) and divergence 


D(p,q ) = / p(dy) ( g(p ) - (g(q) + g'(q)(p - q))) . 


(4.9) 


When g(x) = log(x), after removing the constant term, b has the form 


b(x,q ) = 


q(x) 


p(dy) log (g(y)) 


= -e fW - / p(dy)f(y), 


(4.10) 

(4.11) 


where / = logg. This is a proper scoring rule |Dawid and Musio, 2014 , but it is not local 
because it depends on values of q besides the observation x. An estimation procedure for 
nonparametric graphical models using smoothing splines was based on this scoring rule in 
Jeon and Lin, 20061. 


4-2.3 Hyvarinen Score 

Consider densities q which are twice continuously differentiable over X = IR^ and satisfy 


||p(x)V logg(x)|| —» 0, for all ||x|| —» oo. 


(4.12) 
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where X ~ p. Consider the scoring rule 


h{x,q) = -||Vlog 9 (i)|| + Alogij(i), 


(4.13) 


where V denotes the gradient operator and A is the operator 


A0(x) = 


d 2 cj)( 


x) 


ieV 


dxj 


(4.14) 


This is a proper and local scoring rule Parry et ah, 2012 . Using integration by parts, it can 


be shown it induces the Fisher divergence: 


F(p | q) = E Xr 


V log 


P( X) 


q(x) 


(4.15) 


The optimal score estimator is called the score matching estimator Hyvarinen 2005 


2007 . The Hyvarinen score is homogeneous in q Parry et al. 2012 , so that it does not 


depend on the normalizing constant of q, which for multivariate exponential families is 
typically intractable. Second, for natural exponential families the objective of the optimal 
score estimator is quadratic, so the estimating equations corresponding to score matching 
are linear in the natural parameters |Forbes and Lauritzen, 2014], Maximum likelihood for 
exponential families generally involves a complex mapping from the sufficient statistics of the 
data to the natural parameters |Wainwright and Jordan 2008 Brown, 1986| , necessitating 
specialized solvers. 


4-2.4 Score Matching for Exponential Families 

For a pairwise density, define f.g = , ■ ■ ■ , ■ For Uh, denote Gq(x) := -ggrf-g and 


(A-W), := 


(4.16) 
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Taking derivatives, 




(4.17) 


thus h takes the form 


h(x,9) = ^ (\ dT ,t a i( X ) a i( X ) Td -,i + K ;i( X ) T 0;i) ■ 
i£V ' ' 


(4.18) 


h is a sum of d positive-semidehnite quadratic forms, so it is also psd quadratic. Alternatively, 
we may write h(x,9 ) = 6^ A(x)8 + K(x)^~9, where A(x) is a psd matrix with at most 2 md 

non-zero entries per row, and K(x) is a vector with K % j = + l{i j } V . If we write 

T 1 J 

8 = \ 8j\i8~ 2 , • • •, J , where 6{j = 8ji, we may write the scoring rule as 


1 IT 


h(x,8) = ^8' A(x)8 + K(x) t 8, 


(4.19) 


where A{x) is a block-diagonal matrix, 


/ 


A{x) = 


a\{x)a\(x) 


T 


CL2( x )a2( x 


iT 


\ 


V 


a d (x)a d (x) T ) 


(4.20) 


and K = (Kj ^,..., Kj^j - We will alternate between these two equivalent representations 
of h based on convenience. 

Remark 4.2.1 (Bounded supports). From the differentiability assumption we see that our 
derivations do not generally apply when is a half-bounded or bounded support as the 


density may not be differentiable at the boundary. However, in Hyvarinen 2007 a proper 


scoring rule was derived for half-bounded supports, which may be shown to have the same 
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form as (4.19), after modifying slightly the formulas for dj(x), K(x). Here we derive a similar 
formula for densities on [ 0 , 1 ]^. 


Proposition 4.2.2. Consider random vectors taking values in [0,1]^, with density q; suppose 
X ~ p. If q is twice continuously differentiable and satisfies 


||p(x)V logg(x) ( 8 ) x(l — x)|| —> 0, for all x approaching the boundary , (4-21) 


where ® denotes the tensor product x ® y \= {x\y \,..., xyyf), then 


h(x, q) := -|| Vlogg(x) <8> x{l — re) || 2 


( - 2 ( 2 xf - 1)^(1 - 
i£V \ 


dlogq(x) c ) 2 log q(x) . 

X P ^ J r x i\ 1 x i) oi 9 I 1 (4.22) 


dxi 


dx 2 


is a proper scoring rule. In particular when q is an exponential family with natural parameters 
9 and sufficient statistics (f>, h(x,9) = ^9~^A(x)9 + K(x)~^9 is a proper scoring rule, where 


d(i),; 

K(x)ij = -2(2 xi - l)xi(l - x i)~Qfrr + fot 1 _ x *)) 


,2 d2( t>ij 

dx 2 


a i{ x ) = x i( 1 - x i) 


dx; 


(4.23) 

(4.24) 


and A(x ) = diag(ai(x)ai(x ) T ). 

Thus, all of the results in this work may be effortlessly carried over to exponential families 
over bounded supports. 


4.3 Previous Work 


There is a small but growing literature on applications using the Hyvarinen score for estima¬ 


tion. |Sriperumbudur et ah, 2013 consider using the Hyvarinen score for density estimation 


in a reproducing kernel Hilbert space (RKHS). They consider the optimization for a density 
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1A 


mm < — 
q n 


k=1 


h(X\q) + - 


"2 


n (> 


(4.25) 


where || • |||^ is the norm of the RKHS. After an application of the representer theorem 


Kimcl- 


dorf and Wahba, 1971 , they show this may be expressed as a finite-dimensional quadratic 


program. They derive rates for convergence to the true density with respect to the Fisher 
divergence. 


Vincent, 2011 shows that the denoising autoencoder may be expressed as a type of 


score matching estimator, which they call denoising score matching. Suppose that A" is a 
version of a sample A" which has been corrupted by Gaussian noise, so that its conditional 
distribution has the score <9log q(x \ x ) = \{x — x). Suppose we seek to fit the corrupted 
data according to a density of the form 


logp(x | W, b, c) oc- J (c, x) — — \\x \\2 + softplus J x) +bj 


(4.26) 


where softplus(a;) = max(0,x), then minimizing the Fisher divergence between the model 
density and q(x \ x) can be shown to be equivalent to minimizing 


E 


q(x,x) 


11 IF T sigmoid (W X + b) + c-X\ 


(4.27) 


This is a simple denoising autoencoder with a single hidden layer, encoder f(x) = sigmoid(bFx+ 
b ), and decoder f'(y) = W T y + c. 


Score matching has also been used for learning natural image statistics Kingma and 


LeCun, 2010 Koster et al. 2009 . 
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4.4 Score Matching Estimator 


Define the statistics 




r =1 
n 


K =- E A '(- Yr )' 

n 


r=l 


The regularized score matching estimator is a solution to the problem 


(4.28) 

(4.29) 


6 e argmin <j -9 J T9 + K'6 + 11(9) 
0 I 2 




(4.30) 


Here 1Z is the group penalty 


n») = E n»iiii2- 

ijeV 


(4.31) 


This norm induces sparsity in groups (i.e. edges/vertices). In high dimensions, regularizing 


the vertex parameters is necessary, as (4.30) need not exist otherwise. Both the scoring rule 


and regularizer of (4.30) are convex in 0, so it is a convex program. In particular, observe 


that it can be equivalently represented as 


s.t. 


mm 

t,tij 


t + A t 


ij 


(4.32) 


13 


t > -6 J T6 + K 1 9, 
~ 2 

tij ^ ll^ztll- 


>T, 


(4.32) is a second-order cone program (SOCP) |Boyd and Vandenberghe 2004, as the 
quadratic constraint can be re-written as a conic constraint. If T is not positive definite, 


particularly when n > d, (4.30) may not be unique. This is typical for high dimensional 
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problems. One can impose further assumptions to guarantee uniqueness. For example vari¬ 
ous assumptions have been described for the lasso (see an overview of these assumptions in 
Tibshirani et al. 2013]) , but we won’t go into those details here. 


4-4-1 Gaussian Score Matching 


Consider the Gaussian density: 


q{x) oc exp <j — -x^Vtx , 


(4.33) 


for >~ 0. We have 


Vlogg(rr) = —h2x, 
Vj(Vj log q(x)) = - flu. 


(4.34) 

(4.35) 


so the Hyvarinen score is given by 


h(x, Q) — — Gjj H— 


= trace ( —D -I— 

1 2 


(4.36) 

(4.37) 


Let £ = 1 Y2r=1 W r (A" r ) T . The optimal regularized score estimator Q is the solution to 


min trace ^ -QEQ — D ) + A||fl||i }>. 


(4.38) 


In the notation of (4.30), we have 6 = vec(G), K = vec(J^) and Tj = £ for each i 6 V. 

We do not impose a positive definite constraint on fh Doing so would still result in a convex 

program, indeed it is a semidehnite program, but the resulting computation becomes more 

complicated and less scalable in practice. However, our theoretical results imply that D 

is positive de fini te with high probability. Indeed, denote ||D — D*|| S p the spectral norm 
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(maximum absolute value of eigenvalues) of the difference hi — hi*. Since the spectral norm 
is dominated by the Frobenius norm (elementwise L 2 norm), the consistency result in the 
sequel implies consistency in spectral norm, and so the eigenvalues of will be positive 
with probability approaching one, assuming the population precision matrix hi* has strictly 
positive eigenvalues. Furthermore, we note that our model selection guarantees still follow 
whether or not the estimator hi is positive definite. 


4.5 Main Results 

We suppose we are given i.i.d. data X 1 ,... ,X n ~ p*. p* need not belong to the pairwise 
exponential family being estimated, in which case we may think of our consistency results 
as being relative to the population quantity 


9* : = |E p *[A(A)]} 1 E p * [K(X)\ (4.39) 

= (r*)- 1 /!*. (4.40) 

Define the maximum column sum of 9* by 

1 : = max ( 4 - 41 ) 

j€V,u<m 


and define the maximum degree as 


s ■= max | {(i,j) : (■ i,j) G E}\. 
ieV 


(4.42) 


Assumption 4.5.1. T* = E ?; * [aj(X)aj(A^) T ] satisfies for each i e V, 


00 > e > A max (r*) > A min (r|) > e > 0. 


(4.43) 


Note that this also implies that the eigenvalues of T* are bounded as the rows of T* are 
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non-trivial linear combinations of those of diag(r*), so the inverse in (4.40) exists and is 


unique. 

We also suppose 9* is sparse, in the following sense: 

Assumption 4.5.2. 9* belongs to the set 

V := V(E) = {9 : ||0y || 2 = 0, for (i,j) G E c } . (4.44) 


For both parameter consistency and model selection we require the following tail condi¬ 
tions: 


Assumption 4.5.3. For each i,j, k G V and u < m and t < v, for some c\, C 2 , v > 0, 


P 


*§■ - (K*) 


*\u 

ij 


>t)< exp < —c\nt 


P 


(r i)% - (r*)? fc 


> t) < exp < —C 2 nt 


(4.45) 

(4.46) 


4-5.1 Parameter Consistency 

We present results in terms of the (vector) norm. Note in particular that this result 
doesn’t require any incoherence condition (though we do require for model selection consis¬ 
tency in the sequel). 

For the parameter consistency results in particular, we require the following sub-Gaussian 
assumption: 


Assumption 4.5.4. For each i G V and r = 1, ... ,n, cq( X r ) is a sub-Gaussian random 
vector. 


Theorem 4.5.5. Suppose the regularization parameter is chosen as 




q log (md) 


n 


(4.47) 
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if the sample size satisfies 


n = Q(md), 


(4.48) 


then any solution to regularized score matching satisfies 


\\e-e *\\ 2 = o p 


(d + \E\)mn^ q log (md) 


n 


(4.49) 


Remark 4.5.6. Consider Gaussian score matching. Here m = 1, so if k\q is bounded we 

3* I 


have \\0 — 9 *\\2 = O p 
in |Rothman et al. 


(d+\E\)log(d) \ 


2008 


This rate is the same as the graphical lasso shown 

amounts to 


n I 

. Furthermore, here T? = E, so our assumption 


4.5.1 


bounds on the eigenvalues of E, which are the same as for sparse precision matrix MLE. The 
assumption that ki q is bounded here says that the sums of the absolute value of rows of f2* 
are bounded, which is not necessary for the regularized MLE. 

Remark 4.5.7. We might reasonably expect Kyy = 0(sm), in which case ||0 — 6* H 2 = 
Op ( ^/ (^+I- E l) m ^ 6 _ i n this setting the regularized MLE will have the rate 


Or 


(d + \E\)m log (md) 


n 


(4.50) 


(see results in Appendix A). 


4-5.2 Model Selection 

For model selection we require several additional conditions. Denote E as the edge set 
learned from 9: 


S:={(mMIM2 = 0}. 


(4.51) 
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Furthermore, define 


«r:=ll(rT 1 Hoo, (4-52) 

« 0 :=||0*||max, (4-53) 

p* := min ||6y| max . (4.54) 

Here H^Hoo = rnaxj )T),- | Apj | is the matrix oo norm and || • || m ax the elementwise max 
norm. We require an incoherence condition-. 

Assumption 4.5.8. 


..max II T *ijA T EE) 1 11 2 < - 7 ==, for some r e (°> !]■ ( 4 -55) 

(i,j)£E c yjd + E 

where ||A ||2 is the matrix operator norm. 

In the following theorem we suppose ftp, Kg, s are are bounded, while p* may change with 
the sample size. 

Theorem 4.5.9. Suppose the regularization parameter \ n is chosen to be 


1 uik 


At 


1 ,< 


i log (dm) 


n 


(4.56) 


then if 


n 


n 

0(ma x{m.Ki q log (dm) 


2 2 
, m s 


log(dm)}), 


1 

p * 


= o 



5 


(4.57) 

(4.58) 


there exists a solution to the regularized score matching estimator 6 with estimated edge set 
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E satisfying 


F(E = E)-> 1. (4.59) 

Remark 4.5.10. Assuming m, s, K\ g are bounded, this implies the dimension may grow nearly 
exponentially with the sample size: 


d = o{e n ), 


(4.60) 


with the probability of model selection consistency still aproaching one. 

Remark 4.5.11 (Gaussian score matching). When m = 1, the sample complexity matches 


that for structure learning of the precision matrix using the log-det divergence, in Ravikumar 


et ah, 201l]. Thus Gaussian score matching in particular benefits from identical model 
selection guarantees as the graphical lasso algorithm. However it should be noted that the 
assumptions are slightly different. In particular the graphical lasso requires an irrepresentable 
condition on £ <S> £, while our method involves an irrepresentable condition for £ ® Ig. 


4.5.3 Model Selection for the Nonparametric Pairwise Model 

In this section we consider model selection for the nonparametric pairwise model. We suppose 
the log of the true density p * belongs to Wf , the Sobolev space of order r. This implies, 
along with the pairwise assumption, that logp* has the infinite expansion 


oo 


00 


log p* OC exp <J Y {0*)Yj<l>kl{ x i, x j 

i,j£V,i<j k,l =1 


££(<nh>kte)b 

i&V k= 1 


(4.61) 
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where here {(ftki^kl} is a basis over [0, l] 2 . For an expansion in W£, we have that the 
coefficients decay at the following rates: 


^rfk 2r < oo, 

k 

y 0-; , j k 2n l 2r ^ < oo, 


for all i £ V, (4.62) 

for all (i,j) £ E, Ty + rj = r. (4.63) 


kl 


For our results we assume {(f) k} is the orthonormal Legendre basis on [0,1], and {4>ki} is 
the tensor product basis (fr^x^Xj) = </>fc(xj) ■ 4>[(xj). This is because the supporting lemmas 
are particular to the Legendre basis, but in practice one is not limited to a particular basis. 


Now, consider forming a density by truncating (4.61) after m\ terms for the univariate 
expansions, and m 2 for bivariate: 


m 2 m 1 

logp$ oc exp ; E E e l'4ife.^) + EE«.‘fcw 
i,j£V,i<j k,l= 1 i£V k= 1 


(4.64) 


Observe that this is a finite-dimensional exponential family. Furthermore, the normalizing 
constant for this family will generally be intractable, requiring a d-fold integral. We choose 


our density estimate to be p-^, where 6 is a solution to the score matching estimator (|4.30|) 


for this family. Furthermore, we let the number of sufficient statistics mi, m 2 grow with the 
sample size n to balance the bias from truncation with the estimation error. We denote E 
to be the support of p *: 


S:={(i,j):Kj||=o}, (4.65) 

Now, decompose the vector 6* into the included terms and truncated terms, 6* = 
((d*) T , (6fjr) T ) T an d corresponding sufficient statistics (f) = ((</>) T , (</>r) T )- Denote (ap)i(x) := 
^t(0t)',L A t ( x ) = dia g(( a T)*( 2: )( a r)i( ::c ) T )) and r r = E pI A t( x )}- Applying the results 
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in Section 14.2.41 we have the linear relation 


k* = -v* T e* T - r*r. 


(4.66) 


In the following theorem we assume := ||r£,|| max is bounded, and k\q = 0(m |), 


in addition to the assumptions for the parametric setting stated in Section 4.5.2[ with the 


exception of Assumption 4.5.3 Since the number of statistics grows to infinity in the non- 


parametric case, we need more accurate accounting of the constant terms in the concentration 
inequality. In lieu of the concentration assumption, we have the following assumption on the 
boundedness of the marginals of p. 

Assumption 4.5.12. For each i,j e V, 


e < pij(xi,xj ) < e, 


(4.67) 


for absolute constants e > 0, e < 00 . 

This assumption is mild for density estimation as it only requires bounds on the bivariate 
marginals rather than the full distribution. This is the same assumption used in Chapter 2 
for the TRW estimator. 

Theorem 4.5.13. Suppose that the truncation parameters and regularization parameter are 
chosen to be 


1 


m 2 x n 2r + 13 


(4.68) 


1 


mi x n 2r + 13 


(4.69) 



2r—l 
jl 2r+13 


log nd 


(4.70) 
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and the dimension d and p* satisfy 


. 2r-l \ 

d = o[ e n2r+13 J 


— = o 

p* 


log nd 

2r+l 


77,2r+13 


then there exists a solution 6 such that the edge set E satisfies 


P (E = E) ->■ 1. 


(4.71) 

(4.72) 


(4.73) 


Remark 4.5.14. If r = 2, and s grows as a constant, we may have 


d = o 



5 


(4.74) 


and still ensure model selection consistency. In Chapter 2 it was shown that the sample 
complexity for model selection in the nonparametric pairwise model the regularized expo¬ 
nential series MLE using Legendre polynomials is d = o ^ e n3//7 ^, though this estimator can’t 
be computed exactly. The optimal choice of regularization and truncation parameters is 
much different for these two methods. This is a consequence of different estimation errors. 
In our supporting lemmas (see Appendix B) we require convergence of the statistic T to 
its expectation. In Appendix B we show that applying Hoeffding’s inequality and a union 
bound, 


K i,f)ll r -r| 


max 


= 0 , 



(4.75) 


For the regularized MLE, we needed convergence of the sufficient statistics p, which 
converges at a much faster rate of O p m ' 2 • Our results agree with intuition, that 

the score matching statistics, derived from the derivatives of the log-density, should be harder 
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to estimate than the sufficient statistics. 


Also it should be noted that the assumptions underlying the two results are quite different. 
The MLE involves conditions on the covariance of the sufficient statistics cov p * [0(A)], while 
the score matching estimator requires conditions on T* = E p *[A(X)]. An interesting stream 
of future work would be to better understand the relationship between these two approaches 
and their assumptions. 


4.6 Algorithms 


In this section we consider algorithms for solving (4.30). In our experiments we denote our 


method QUASR, for Quadratic Scoring and Regularization. There are variety of generic 
approaches to solving problems which may be cast as the sum of a smooth convex function 
plus a sparsity-inducing norm |Bach et al. 2011], as well as generic solvers for solving second- 
order cone programs. Here we will propose two novel algorithms which exploit the unique 
structure of the problem at hand. First we will consider an ADMM algorithm; for a detailed 


exposition of this approach, see Boyd et ah, 2011 . In section 4.6.2, we consider a coordinate- 


wise descent algorithm for Gaussian score matching Friedman et al. 2007 


4-6.1 Consensus ADMM 


The idea behind ADMM is that the problem (4.30) can be equivalently written as 


min 

0,z 


l E (GG,>+GAi) + a E 

ieV i,jeV.i<j 



(4.76) 
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subject to the constraint that O^j = Oji = z; t j. The scaled augmented Lagrangian for this 
problem is given by 


L(9,y,z) = - 

Zj 

E (dMi+dAA 

(4.77) 


ieV V 


+ 

'y ] f II z ij II2 + ~~ z ij ) T ~ z ij) 

(4.78) 


i.jeV:i<j 



+ ^||@ij - z ij\\ 2 + ||^j* - z ij \ , 

(4.79) 


here {y} := {iJij, Vji} are dual variables, and p is a penalty parameter which we choose 
to be 1 for simplicity. The idea behind ADMM is to iteratively optimize L over the 0, y, z 
variables in turn. In the first step, since 0{j = 0j t is included as a constraint and may be 
considered separately, L as a function of 6 decouples into d independent quadratic programs, 
one for each ’’column” of 0, which may be solved in parallel. In the second step, z^j pools 
the estimates 0{j and dji from the previous step, and applies a group shrinkage operator. 
The third step is a simple update of the dual variables. 

Due to parallel updating of 0 in step (a) and subsequent averaging in step (b), this is 
known as consensus ADMM. At convergence, the constraints 9jj = Ojj = z^j are binding. In 
practice, we stop when the average change in parameters is small: 

E - sirbii/ E ii^’iii < 1(r4 - (« 4 ) 

i,jeV ijev 

In addition to parallelizing the update (a), other speedups are possible. For example, we 
may compute the eigenvalues A and eigenvectors Q of Tp which may be computed directly 
from the data matrix ... a ? ;(X n )] T using the singular value decomposition (Q being 

the right singular vectors, and y/nA being the squared singular values of the data matrix). 
We may then cache the matrix Q(diag(A + p) -1 )Q T , which is equivalent to (Tj + up 

to numerical error. This can be computed for each i G V, also in parallel, and only needs 
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to be computed once (even if estimating over a sequence of As). When optimizing over a 
path of truncation parameters mi, m 2 , one may utilize block matrix inversion formulas and 
the Woodbury matrix identity to avoid computing the inverse from scratch each time. I 11 
particular, let Yj be the current matrix of statistics, and Y™ ew be the statistic with a higher 
degree of basis expansion. Then Y^ ew has the form for some b, C, 


T,new _ 

1 i 


h b \ 

v r c, 


(4.85) 


The inverse takes the form 


/ 


(r 7 j iew + pi) 1 = 


-1. 


C + pI) b J L (C + pi-b' (T i + pi) b 


\ 


- 1 . 


-1 


(4.86) 


where 


L := ( T, + pi — b T (^67 + piJ b 

-1 \-l 


- 1 . 


-1 


= r , + pi 


T i + pl) b T (d + pl-b j (Ti + pl) b) b[Y l + pI 


-1, 


-1 


(4.87) 
-1 

(4.88) 


If the dimension of C is small relative to that of Tj, (r” ew + pi I can be computed 


-1 


-1 


quickly using the cached (T^ + pij , without the need for any additional large matrix 


inversions. 


4-6.2 Coordinate-wise Descent 


In this section we consider a coordinate-wise descent algorithm for the Gaussian score match¬ 


ing problem (4.38). Coordinate-wise descent algorithms are known to be state-of-the-art for 
many statistical problems such as the lasso and group lasso [Friedman et ah, 20071 and glasso 
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1. Initialize fl = /^; 

2. For i=l,2,... ,d,l,2,..., until convergence: 

(a) for j=i,... ,d: 

IT G i nT 


Qij <- S 


and set Vtji = Q, t j. 


d - E jj 


,A , 


(4.92) 


Figure 4.2: Gaussian QUASR Coordinate-wise descent 


for sparse Gaussian MLE Friedman et al. 2008 . Regularized score matching in the Gaus¬ 


sian case admits a particularly simple coordinate update. Consider the stationary condition 


for Q in (4.38): 


- ( HE + EG ) — I d + Z = 0, 


(4.89) 


where Z is an element of the subdifferential <9||G||i: 


Z ij G 


{®ij '■ \\0ijh < 1}, if Ill'll = 0; 


ft 




Pijh ’ 


if ll^yll 7^ 0. 


in particular, the stationary condition for a particular is 


(4.90) 


9 ) ■*■{* — ■?'} + Z ij ~ 0- 


*T, 


(4.91) 


Consider updating fly using equation (4.91), solving for and holding the other elements 
of Q fixed. After some manipulation, we get a fixed point for fly- is given by (4.92). We 


cycle through the entries of fl, applying this update, and repeat until convergence. 

Here S(x, A) is the soft thresholding function S(x, A) := max{|x| — A, 0}sign(x), and 
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\i := {1,..., i — 1, i + 1,..., d}. Each update only requires two sparse inner products and 
a soft thresholding operation. As such, in our experiments this algorithm converges very 
quickly, sometimes much faster than glasso for the same set of data. 


4-6.3 Choosing Tuning Parameters 

As of yet we have not discussed how to practically choose the regularization parameter A 
and for nonparametric score matching, the truncation parameters mi, m2- We suppose the 
existence of a held-out tuning set; in the absence, one may use cross-validation. If the 
likelihood is available, for example if fitting Gaussian score matching, or for a fixed graph 
which is a tree, we minimize the negative log-likelihood risk in the held out set. In the 
absence of the likelihood, we choose the tuning parameters to minimize the Hyvarinen score 
of the held out set. For a discussion on using scoring rules as a replacement for the likelihood 
in model selection and using score differences as surrogates for Bayes factors, see |Dawid and 


Musio, 2014 


To save on computation, we use the idea of warm starts which we detail in the sequel. 
First, observe that the first-order necessary conditions for regularized score matching are: 


T9 + K + Z = 0, 


(4.93) 


where Z denotes the sub gradient of the regularizer 7Z , at 9 , which is 


so 9 = 0 when 


Zij — 


{x:||x||<l}, ||0*j||=O, 


\\ 0 , 




IJ | 


o/w. 


A > max || K, 

ij 


ij\ 


(4.94) 


(4.95) 


This allows us to choose an upper bound A start such that the solution will be the zero vector. 
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The idea behind warm starting is the following: we begin with estimating 9\ = 0. 

Then we fit our model on a path of A decreasing from A start, initializing each new problem 
with the previous solution 9\. The solution path for the regularized MLE is smooth as a 
function of A, suggesting nearby choices of A will provide values of 6 which are close to one 
another. 

We can also incorporate warm-starting in choosing mi, m2- For a given A, we first es¬ 
timate the model for first-order polynomials, corresponding to mi = m 2 = 1. We then 
increment the truncation parameters by increasing the degree of the polynomial of he suffi¬ 
cient statistics. We augment the previous parameter estimate vector with zeros in the place 


of the added parameters, and warm start ISTA from this vector. See Section 4.6.1 for other 
computation savings when augmenting the sufficient statistics when choosing mi, m 2 . 


4.7 Experiments 


4-7.1 Gaussian Score Matching 

We begin by studying Gaussian score matching, and comparing to the regularized Gaussian 


MLE, using the glasso package in R Friedman et ah, 2008 . We consider experiments with 


two graph structures: in the first, a tree is generated randomly; this has d — 1 edges. In the 
second, a graph is generated where an edge occurs between node i and j with probability 
0.1, denoted the Erdos-Renyi graph. This graph has expected number of edges 0.05 -d(d— 1). 
The data is scaled to have unit variance and mean zero. 


Regularization Paths 


Figures El and |4.4| display regularization paths for one run of these simulations, where 
d = 100 and n is either 100 or 500. Relevant variables are plotted in black. For n = 500, 
it appears that the score matching estimator does a better job screening out irrelevant 
variables for both graph types. For n = 100 they perform similarly. The score matching 
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estimator tends to produce nonzero parameter estimates which are larger in magnitude than 
the regularized MLE, which is more pronounced for n < d. 


Risk Paths 


Figures 4.5 and 4.6 show risk paths under the two graph structures; figure two has d = 150, 
with 149 included edges; figure four has d = 100, with 499 edges included. We choose 
n = 100, and calculate the negative log likelihood risk using a held-out dataset of size n. 
The plotted curves are an average of 25 simulations from the same distribution. In figure 
4.5[ we see the score matching estimator selects a sparser graph than the regularized MLE; 


furthermore, the score matching produces an estimator with smaller held-out risk. For the 
Erdos-Renyi simulation, the score matching estimator also selects a sparser graph, though 
it has risk slightly worse than the MLE. 


These findings are also validated in Tables 4T and T2 varying d. For the tree graph, 
score matching dominates in risk for all values of d. Even for the Erdos-Renyi graph, score 
matching outperforms the regularized MLE when d = 30. Standard errors of 25 repetitions 
are in parentheses. 



quasr 

glasso 

d= 30 

28.082 (0.341) 

28.189 (0.319) 

d= 75 

72.785 (0.472) 

73.202 (0.432) 

d= 120 

116.771 (0.578) 

117.631 (0.542) 

d= 150 

147.26 (0.619) 

148.381 (0.583) 

Table 4.1 

: Held-out NLL 

error, tree graph 


quasr 

glasso 

d= 30 

26.157 (0.428) 

26.244 (0.386) 

d= 50 

44.351 (0.688) 

44.262 (0.718) 

d= 100 

91.383 (1.009) 

90.351 (1.077) 

d= 150 

139.03 (1.15) 

136.793 (1.336) 


Table 4.2: Held-out NLL error, ER graph 
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Figure 4.3: Regularization path, tree graph. d=100. Top: n=100. Bottom: n=500. 
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Figure 4.4: Regularization path, Erdos-Renyi graph graph. d=100. Top: n=100. 
n=500. 


Bottom: 
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Risk path, d= 150 



Figure 4.5: Risk path, tree graph 


Risk path, d= 100 



Figure 4.6: Risk path, ER graph 
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ROC curve, d= 150 



Figure 4.7: ROC curve, tree graph. Solid line: Gaussian QUASR; dotted line: glasso. 


ROC Curves and Edge Selection 


Figures |4.7| and |4.8| show ROC curves under the same simulation setup in the previous 
section. The plotted points represent the graph selected from the minimal held-out risk 
in each of the 25 repetitions. The two estimators display very similar ROC curves, and 
the score matching estimator tends to prefer higher sensitivity for lower specificity, when 
selecting using held-out data. 


Tables |4.3| and |4.4| displays true positive and true negative rates for varing choices of d, 
fixing n = 100. The parentheses are the standard deviation for 25 repetitions of the experi¬ 
ment. As are suggested by the ROC curves, score matching prefers a higher sensitivity and 
lower specificity to the regularized MLE, and for simulations when d is relatively small com¬ 
pared to n, score matching has a significantly higher true positive rate with only negligible 
reduction in true negative rate. 
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ROC curve, d= 100 



Figure 4.8: ROC curve, Erdos-Renyi graph. Solid line: Gaussian QUASR; dotted line: 
glasso. 

Computation 


In Figure [49] we compare the runtime of our algorithm with glasso. We simulate random 
Gaussian tree data with n = 100, d = 50. We fit over a path of As and plot runtime against 
number of selected edges. More regularization results in sparser graphs, and so convergence 
is faster. In this experiment our method is much faster than glasso, sometimes by a factor 
of 4 or more. The gap narrows for sparse estimated graphs. This is because while our 
algorithm is written efficiently in C + +, it doesn’t (yet) utilize sparse matrix libraries, while 
glasso does. Since our coordinate-wise descent algorithm involves sparse inner products, we 
believe our runtimes can be improved in the sparse regime. 
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CPU time (seconds (log)) 



Figure 4.9: Runtime, Gaussian QUASR and glasso. Gaussian data, n=100, d=50 
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quasi' 

glasso 

d=30 %TN 

0.941 (0.043) 

0.975 (0.025) 

d=30 %TP 

0.778 (0.051) 

0.674 (0.042) 

d=75 %TN 

0.821 (0.037) 

0.876 (0.042) 

d=75 %TP 

0.892 (0.017) 

0.832 (0.017) 

d=120 %TN 

0.751 (0.052) 

0.816 (0.044) 

d=120 %TP 

0.922 (0.01) 

0.874 (0.009) 

d=150 %TN 

0.699 (0.049) 

0.776 (0.052) 

d=150 %TP 

0.936 (0.006) 

0.899 (0.012) 

Table 4.3: Edge selection accuracy, tree graph. 


quasr 

glasso 

d=30 %TN 

0.969 (0.028) 

0.986 (0.02) 

d=30 %TP 

0.743 (0.036) 

0.612 (0.039) 

d=50 %TN 

0.878 (0.028) 

0.927 (0.025) 

d=50 %TP 

0.785 (0.027) 

0.673 (0.032) 

d=100 %TN 

0.633 (0.029) 

0.754 (0.02) 

d=100 %TP 

0.858 (0.013) 

0.753 (0.013) 

d=150 %TN 

0.486 (0.019) 

0.634 (0.018) 

d=150 %TP 

0.892 (0.01) 

0.791 (0.011) 


Table 4.4: Edge selection accuracy, ER graph 


4-7.2 Nonparametric Score Matching 
Risk and Density Estimation 

In this section we compare score matching and MLE when the sufficient statistics are chosen 
to be Legendre polynomials. For each choice of n, we simulate non-Gaussian data whose 
density factorizes as a tree. We do this by marginally applying the transformation y = 
sign(x — 0.5) | x — 0.5 | !l ' 6 /5 + 0.5 to Gaussian data which has a tree factorization, which 
has been scaled to have means 0.5 and covariance 1/8 2 , so it fits in the unit cube; the 
resulting data follows a Gaussian copula distribution, which is also a pairwise distribution. 
We train the model using both regularized MLE and score matching under constraint that 
it factorizes according to the given tree, and we estimate along a path of As and choose the 
regularization parameter A to minimize the held-out risk. Since the density has a (known) 
tree factorization, it is possible to compute the likelihood (and hence the MLE) exactly using 



functional message passing (Chapter 3). It is also possible to compute the marginals using 
the same message passing algorithm. 


Figure 4.10 displays the held-out risk for both methods, with sample size varying from 
24 to 5000 and d = 20. We average over 5 replications of the experiment. The MLE 
outperforms the score matching estimator, but the score matching estimators performance 
greatly improves relative to the MLE as n increases. 


Figure 4.11 displays contours from one bivariate marginal from the aforementioned sim¬ 
ulation. The top row shows the MLE and score matching estimator for n = 24, and the 
bottom for n = 182. For n = 24, the score matching marginal can make out much of the 
distinguishing features of the density such as the multiple modes, but isn’t as informative 
as the MLE. At n = 182 the marginals appear almost identical. We conclude that while 
MLE is more efficient as may be expected, score matching performs quite well, especially 
with larger sample sizes. Furthermore, we emphasize that at this cost of statistical efficiency, 
score matching can be computed easily under any graph structure, even when the likelihood 
is not tractable, while MLE is typically not tractable and must be approximated. 


ROC Curves 


Figures 4.12 and 4.13 display ROC curves from four experiments. We simulate data with 
d = 20 and n either 30 or 100. In the first two experiments, the data is Gaussian; in the last 
two, it is non-Gaussian (copula). The graph is either a random spanning tree with d— 1 edges, 
or a graph with each possible edge having inclusion probability 0.2, or expected number of 
edges d(d — 1) * .1. The ROC curves trace the true positive and true negative rates, varying 
the value of A. The curves are averaged over 10 repetitions of the experiment (with the data 
i.i.d. from the same distribution). We choose mi, m 2 to maximize max^ TP(\)+TN(\). We 
compare our method to the SKEPTIC estimator from Liu et ah, 2012a], which was designed 
in particular for model selection for copula graphical models, and the TRW estimator from 
Chapter 3. Our experiments show our method to do either just as well, or only slightly worse 
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Held-out neg log lik 



Figure 4.10: Held-out negative log likelihood for different training sizes. Red: regularized 
MLE; Blue dashed: score matching. Data generated from a non-Gaussian tree distribution. 














than competing methods. 


4.8 Discussion 

This chapter introduces a new approach to estimating pairwise, continuous, exponential fam¬ 
ily graphical models. Since the normalizing constant for these models is usually intractable, 
we propose a new scoring rule which obviates the need for it. Our resulting estimator may be 
expressed as a second-order cone program. We show consistency and edge selection results 
for this estimator, including as special cases a new method for precision matrix estimation, 
and for nonparametric edge selection with exponential series. We propose algorithms for 
solving the convex problem which are highly scalable and amenable to parallelization. This 
method has good experimental performance compared to other works in the literature. 
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Figure 4.12: ROC Curve, Gaussian data, n=30, d=20. Top: ER graph. Bottom: Tree graph. 
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Figure 4.13: ROC Curve, non-Gaussian data, n=100, d=20. Top: ER graph. Bottom: Tree 
graph. 
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CHAPTER 5 


CONCLUSIONS 


This thesis proposes a new framework for density estimation, inference and structure learn¬ 
ing for the nonparametric pairwise undirected graphical model. We consider approximating 
the log density using a truncated basis expansion, which results in a finite-dimensional ex¬ 
ponential family. We consider two estimation approaches. The first is regularized maximum 
likelihood, for which we provide a variational approximation method. The second is a new 
method for estimation and graph learning of exponential families based on the scoring rule 


of Hyvarinen 2005 . We show that score matching allows for provably consistent parameter 


estimation and structure learning for exponential families, despite exact inference for this 
class of densities being intractable. As a special case we derive a new method for sparse 
precision matrix estimation, which performs competitively in experiments. We also derive 
results for the exponential series approximation, and show its performance in experiments. 

This thesis contributes to two strains of literature. The first is that of pairwise density 
estimation Gu[ 2002 Jeon and Lin 2006 Liu et ah] 2011 . Our approach is novel in that we 
use exponential series estimators, rather than Mercer kernels or kernel density estimators. 
We also introduce a regularization approach to edge selection for learning sparse pairwise 
models. We show the regularized MLE estimator adapts to the unknown sparsity of the oracle 
pairwise density. In Chapter 3 we propose a variational approach to approximating the log- 
likelihood for nonparametric pairwise models; this gives an upper bound to the negative log- 
likelihood risk for an estimator and also gives a tractable algorithm for estimating pairwise 
models. 

The second contribution is to the literature on high-dimensional graph selection. The lit¬ 
erature focuses overwhelmingly on parametric models such as the Gaussian graphical model, 


the Ising model, or other parametric models with special structure Yang et ah, 2012 . We 


contribute to this literature in two ways. Firstly, our QUASR estimator works for exponen¬ 
tial families broadly. It allows to both have consistent parameter estimation as well as edge 

93 

















selection, even for exponential families which are non-normalizable. Second, applying the 
QUASR method to exponential series gives a method for model selection for the nonpara- 
metric pairwise graphical model. There has been recent effort to expand from parametric 


models to nonparametric or semiparametric graphical models, such as Gaussian copulas Liu 


et ah, 2012a| and forests Liu et ah, 2011 . Our approach is the first to address learning the 


fully nonparametric pairwise model and to demonstrate a tractable method with statistical 
guarantees for model selection. In addition to our theoretical and methodological contri¬ 
butions, we show the TRW and QUASR estimators perform well in practice, particularly 
compared to previous methods, and our algorithms are scalable and can take advantage of 
parallelization. 

There still remains more work to better understand the score matching estimator and 
its connection to the widely-used maximum likelihood estimator. While our experiments 
are a start, a more detailed study comparing the assumptions of the theoretical results is 
warranted. It would be valuable to conduct a parallel theoretical analysis like what was done 


for the Danzig selector and lasso in Bickel et ah, 2009 . Also, the theoretical guarantees 
for the score matching approach are generally weaker than for MLE; it would be interesting 
to find the optimal rates for estimation and sample complexity for edge selection in the 
computation-limited setting (i.e., restricting to algorithms, which can be computed up so 
some level of accuracy in polynomial time, which excludes MLE), and given that, End an 
algorithm which achieves those optimal rates, if indeed the score matching approach is sub- 
optimal. 

Our work is not the final word on the subject. There are other fruitful paths for deriv¬ 
ing good approximations or alternative estimators for exponential families with continuous 
potentials and for the nonparametric pairwise model, such as using semidehnite relaxations 


Lasserre, 2007 , or in the case of polynomial sufficient statistics, approximating the log den¬ 


sity as a sum-of-squares polynomial, which can be normalized efficiently. It would also be of 
interest whether proper scoring rules |Dawid and Musio, 2014] beyond the Hyvarinen score 
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would be useful for estimating these models. In Chapter 4 we show that the nonparametric 
pairwise estimator of |Jeon and Lin, 2006 optimizes the Bregman score , which is a proper 
scoring rule. Exploring these connections more may give rise to more provably consistent 
estimators or other sound approximations or variational approaches to these problems. Also, 
there is more work that needs to be done to better understand the behavior of variational 
techniques. We hope this thesis has proven the value of the exponential series approach 
to nonparametric estimation, and we hope our contributions are useful tools for analysis of 
complex, high-dimensional data sets. 
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APPENDIX A 


PROOFS FOR CHAPTER 2 


Proof of Lemma 2.5. 7. Observe that if \ n > 0, (2.3) may alternatively be written as 


argmin j — C{9) + XIZ{9) 1, (A.l) 

8eV:K*(0 e )<C l J 

for some C < oo. Thus the edge parameters of the solution are bounded. The only 
question is whether the objective approaches — oo as ||0 W || diverges to infinity. Examine the 
first order condition for 9 V : 


Lv — 


(A.2) 


For any hxed 9 e , the inverse of this equation produces a unique 6 so long as fi v belongs 
to the interior of the mean space, due to (|5]). For bases following the Haar condition, it 


has been shown that this is satisfied with probability one when n > m\ Crain, 1976 


Furthermore, due to the strong convexity of the objective ([3]), this solution will be unique 
when it exists. Si 


Let p = pz, where 9 is the solution to (2.3). Let 9 be any solution to the population 


mimmizer 


9 = argminKL(p \ pg). 
OeV(E) 


Csiszar 


1975 


(A.3) 


and is the density satisfying 


Pq is known as the information projection 
E Pq\4>e] = Ep[0_e]- Note that in particular, 9j^c = 0, so p^ and pg* have the same sparsity 
pattern. Indeed, since {0} is a minimal family, 9 is unique. For an information projection 
we have a Pythagorean theorem, which we present below. 


Lemma A.0.1. [Pythagorean Theorem for KL divergence] 
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Suppose pg belongs to the finite exponential family with sufficient statistics {(f)}. Then 

KL (p I p e ) = KL(p I Pq) + KL {p~ e I pg). (A.4) 

Proof. Notice that log (p/pg) = log (p/Pg) + \og(pg/pg). Taking expectations of both sides 
with respect to p gives 


KL(p | p e ) = KL (p | P q) + E p (log (pg/pg)) (A.5) 

From the moment-matching properties of the information projection, we have 


®pQog(pg/pg)) = E p -(log {Pg/pe)) = KL (pg | pg). (A.6) 


The result follows. □ 

This lemma implies that KL(p | p) can be split into two terms: the approximation error 
and estimation error. We start by analyzing the approximation error. 


A.l Approximation Error 


, Lemma 1] Let p, q be two densities with respect 


to the measure u. Then 


Lemma A.1.1. [[Barron and Sheu, 1991 


KL(p | q) < e ll lo g^/9-c||oo 


I p{\ogp/q 



(A.7) 


for any constant c. 

Proof. Using the Taylor expansion of e x , we get the bound 

2 

e x — 1 — x < e x+ , (A.8) 
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where x+ = max(0, x). Set h(x) = log p{x)/q{x) — c. Then 


l P^ogp/q = 
< 


< 


< 


j (plogp/q + qe c — p — c) — e c + 1 + c 

J p(h + e~ h ~ l) 



(A.9) 

(A.10) 

(A.ll) 

(A-12) 

□ 


Consider the linear space S m := S miyrri2 of functions spanned by the truncated basis 
elements {(j)}. Such functions need not be the logarithm of a valid density. 

Theorem A. 1.2. For f = logp let 


— ||/—/to||l 2 (p), (A.13) 

7 m = 11/ — /ro||oo) (A.14) 


be the L 2 (p) and Lqq approximation errors for some f m G S m . Then the information 
projection p„ satisfies 

KL(p | P g) < ^e 27m A 2 m . (A. 15) 

Thus, 'if 7 m bounded, 


KL(p | p ~ e ) = 0(A 


2 ) 

m) 


(A.i6) 


Proof. Let f m = ( 6 *, 0) +9q be the approximation of / presumed to satisfy the given bounds 
on the error f—fm■ Define a density p^ as logp^* = (0*, (jj) — Z{6*) and denote /^* = logp^*. 


Using the bound from A. 1.1 and setting c = 8q + Z(6*), we have that 
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eWf-fo-Oo-mWoo r _ x2 

KL(p | p*) < --- J P (/ - fg - 05 - Z(0*)) 2 (A.17) 

g2||/—/m||oo „ 

^ 2 11/-/m||x, 2 (p) (A.18) 

e 2 7m 

= — A m- (AA9) 

Since 0 achieves the minimum KL-risk, we have 

KL (p | Pq) < KL(p | p^) < A ?n- (A.20) 

□ 

Lemma A.1.3. The L 2 G 9 ) and Lqq approximations to / = logp satisfy 


A m = O 2r + |A| m 2 2r ^ , (A.21) 

7m = O (dm" r+a+1/2 + |£| m" r+2a+1/2 ) , (A.22) 

and the truncated variables 6^ satisfy 

II^tIIoo = o(max{m 1 r 1//2 ,m 2 r 1//2 }). (A.23) 

Proof. By the boundedness of p t j and Parceval’s identity, 

& 2 m = Jp(e bM 2 (A.24) 

<~ej {Qt.^v) 2 (A.25) 

— ell^lll- (A.26) 
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Observe that 


11411 2 = E E (»?) 2 

ieO k>m\ 

+ E ( E («§) 2 + E («y ) 2 

(■ i,j)£E \k>l,l>m,2 k>rri2,l<m2 

Now, we have for each i e V, 


and also 


E (O 2 = E «b 2 * > *r 2r 

k>m\ k>m\ 


k^ u 2 r 1 m -2r 


- ( XI I '"1 

ik>mi 


< Cm 1 2r , 


X (^j ) 2 = X ( Oij) 2 k 2ri l 2r ik 2ri i 2r i 


k>l,l>rri2 


k>l,l>m,2 


< ( X (^g) 2 * 2 ^ 

\k>l,l>m2 


kl A 2 h 2r il 2r j m ^ 2r 0 


= 0(m 2 2r ). 


We similarly get a bound of 0(m 9 2r ) when summing over k > m 2 , l < m 2 , 
we have 


— ^(ll^rlli) — 2r + 1-^1 '^2 _J? )■ 


(A.27) 

(A.28) 

(A.29) 
(A.30) 

(A.31) 

(A.32) 
(A.33) 

Combining 

(A.34) 
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To analyze 7 m , observe that 


7m ~ | E E 
i£V k>m 1 

+ y | e (^j)^+ e (^ij) < pki 

(i,j)eE \k>l,l>m2 k>m2,l<rri2 

< V V |«f 6 (*) 

k>m\ 


E E | (fg)6(fe)6(/) 

(■ i,j)€E \k>l,l>m2 


E |(fg)6(fc)6(/) 

k>m2,l<iri2 


(A.35) 


(A.36) 


Then we have 


k>m\ 


E #,■!>(*) =0 e " 7 “ 


, k>rrii 


1/2 


< o 


= O (m 


E 

\k>rri\ 

—r+a+1/2 

T 


klT\2 


( e ^ 2<-, ’ +a) J 

\k>m\ ) 


\ X / 2 \ 

/ 


(A.37) 

(A.38) 
(A.39) 


Similarly, 


E |«§(>wko| = o e K 7 “ r 

k,l>m 2 \k>l,l>rri 2 


kl i„n lfj ^ 2 


<011 E ( d ij kni 

l k>l,l>iri2 


= O [ m j 


—rj+ 1/2+a 
2 


(A.40) 

1/2 / \ l/2\ 

^ A; 2(-r i +a) / 2(-r i +a) j 

yk>l,l>rri2 ) j 

(A.41) 
(A.42) 
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The result is similar summing over k > m 2 , l < m 2 ■ Combining, we get 


7 m = O [dm l r+a+1 / 2 _|_ \e\ m 


—r+a+1/2^ 


(A.43) 


Finally, the series ^^ >1 (^) 2 fc 2r converges if and only if ( 9 ^) 2 = o{k 2r 1 ). Similarly, 
X)^>l(0^-) 2 fc 2r converges if and only if (6^) 2 = o(/c _2r_1 ), and similarly when summing 


over Z. It follows that 


II^tIIoo = o(max{m 1 r 1 ^ 2 ,m 2 r 1//2 }). (A.44) 

a 

A.2 Estimation Error 

Consider the Taylor expansion of Z{9 ) up to order 2, noting that VZ(9) = E q[4>\ = fi(9) and 
V 2 Z(9)= cov Pe $\: 


Z[9 + 5) — Z(9) + {5,^l(9)) + -<5 t ( cov Pd+z5 [(p \) h, 


(A.45) 


for some z e [0,1]. See for example Kakade et al., 2010 Portnoy, 1988 
Consider the function 


£(S) = C{9 + 5) - C{9) - A {R{9 + 6) - R(9 )). 

Note that £(0) = 0, and so 5 = 9 — 9 must satisfy E(5) > 0, if 9 exists. 

Lemma A. 2.1. Denote n* = E p [0] = E^[0]. Suppose X n > 1Z*(ji e — /j* e ), Wjj, v — /j, v \\ < 
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and A \^E < '^ 2r ^ Cs ) then the regularized MLE 9\ is unique and satisfies 


\\ 9 V ~ 6 v\\ < 


2g s 


*"u h'vlb 


ll«e-»ell < —XnVWl 
4 cs 


KL (Pfj | Pn)< 


9 1 - 2c 5 


^ — h*)|| 2 + \E 


Proof. Using the cumulant expansion of Z, we have for some z G [0,1], 


(A.46) 
(A.47) 
(A.48) 


C{9 + 5) - C(9) = (5, Jj) + Z(9) - Z(9 + 5) (A.49) 

= (S, P ~ p) ~ ( co % +25 [ 0 ]) & (A.50) 

Consider the set S := {0 : || 9 — 0*|| < r^} for a constant r^. S is a convex and compact 
set; since </> is minimal, V 2 Z(9) = covg[(j)] > 0 ([3]) for all 9 G S; thus — C is strongly convex 
over S with <5 T (cov0[(/>])<5 > cs||5|| 2 for a constant eg. 

By the (generalized) Cauchy-Schwarz inequality, 


\(Pe A*ej <^e) | A 
< 


77(<5e)77 (/ie He) 

f (mp). 


(A-51) 
(A.52) 
(A.53) 


the last line following because 77 is decomposable with respect to V. Further, from 


Negahban et al. 2012 Lemma 3, because 77 is decomposable, it holds that 


77(0 + 5) - 77(0) > K(6p ± ) - K(6p). 


(A.54) 
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Combining (B.17) and (B.16), 


(6e,fie ~He) ~ ^ (»(« + *) - «(»)) < (A.55) 


Using the subspace compatibility constant we have that 


K(<%) < vUi||£p|| < v'ilfHiell. 


(A.56) 


Additionally, the second term in (B.19) is negative and can be ignored. From Cauchy- 
Schwarz we also have 


{$vi fJ'v) A || fJ*v /^dIHI^i; 


(A.57) 


Now, consider the set 


C= U: H^ll < 


2c S 


MM\ < 


4c 5 


(A.58) 


denote the boundary of C by dC and its interior by intC. Note that 0 E C. If E(S) < 0 for 
each 5 G dS , noting that £1(0) = 0, by the convexity of £ and the fact that C is a compact 
set, it must be that S G intC. Suppose that 11 Jl v — /i* v \\ < 2 y|‘ s and |Ay/f ~E\ < 2/ y|‘ s , so 
that for 5 G C, ||5|| < rg, so that for any z G [0, 1], 0 + zS G S. Using ( |A.50[ ) and (B.19) we 
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obtain for 5 e dC, 


£{S) = C(0 + 5) - C{9) - A (11(6 + 6) - 

< {Sv, %) - t4) + (^e, V'e - A*e) ~ Lsll^ll 2 “ A 


+ «)“ 




-rfll + ^^11411 -c £ ||^|| 2 


— Il^ull (ll/b; /A>ll c <sll ( MI) + ll^ell 


f 3Av^ 

v“ 


c «sll^e|| 


< 0 . 


The claim is verified. 
Furthermore, since 


(A.59) 
(A.60) 

(A.61) 

(A.62) 

(A.63) 


KL [p~ 9 | p d ) = (9- If) - Z(0) + Z{9), 

(A.64) 

note that 


E{6) = (6, ju) - Z(0 + 5) + Z(0) - A (71(0 + 6) - 77(0)) 

(A.65) 

= -KL (p~ Q | p~) + {8,(2- n*) - A(77(0 + 5) - K(0*)). 

(A.66) 

because £{8) > 0, 


KL(pj | pg) < (6,p- p*> - A (k(0 + 6)- R(0)) 

(A.67) 

<iiAiiiiP«-p;n + :i KAl|i? e |i 

(A.68) 

<4(ii?»-kii 2 + 9A /)' 

(A.69) 


□ 
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Lemma A.2.2. Suppose that the univariate sufficient statistics satisfy \ (f>k \< b(k ) for all 
k and b(k) is increasing in k. For any t > 0, 


nwn, - tivf > t) < ^ 

nt 

P(7 Z*(ft e — /Xg) > t) < 2 exp 

P(||/x - /x*||oo > t) < 2 exp 


log (d 2 ml) 


nt 2 


8em2^(m2) 4 


log(dmi + d m 2 ) — 


nt 2 


8emax{h(m2) , fr( m l) / 


(A.70) 
(A.71) 

(A.72) 


Proof. Let (co Vp[f>]) v be the covariance matrix of the sufficient statistics 0 restricted to the 
vertex parameters. Then ((cov p [0]) v )^ 1 / 2 (/i z ,— /i v ) is a vector of dm\ mean zero, uncorrelated 
random variables each with variance 1 jn. It follows from Markov’s inequality that 


P ((//„ - fi v ) T ((cov p [(j)]) v ) l (p v - n v ) >tj < -(A.73) 

From the boundedness of the marginals pjj and the orthonormality of {0}, e 1 11A11 2 < 
<5 T (covp[0]) _1 h for any <5. It follows that 

P(||/i„-,i„|| 2 >«)<^i. (A.74) 

nt 

Now, to bound 77*(/7 e — /i*) we will need to find an exponential concentration for each 
Wfilj — pij ||. Suppose that the univariate sufficient statistic is bounded: | 0& |< b{k) for all 
k. Applying Hoeffding’s inequality, we get for each indices k, /, 


p ( e 1/2 1 fiij ~ f4j \ >*)< 2ex P 


t2 X 

8 b(k)‘ 2 b(l)‘ 2 j ’ 


(A.75) 


and so 


106 



(A.76) 


p (e 1 \\fiij ~ IPjf > m^(m 2 ) 4 t 2 ) < P j (J {e 1/2 | - nfj 

\k,l 

^ f nt 

L ex p(-^ 

7. 7 1 V- 


> 


k,l=l 

2 f nt 1 
< 2m 2 exp 


Applying the union bound once more, 


P(77* (jl e — /ig) > t 2 e) < 2exp log(d z m 2 ) 


2 _ 2 > 




8m2&(m 2 ) 4 ' 


Proof of Theorem 2.5 . 4 From theorem |A.1.2[ we have the approximation error 


KL(p p„) = 0 (e 2l ™A 2 m ) , 


From IA.1.31 we know that 


(A. 77) 
(A.78) 


(A.79) 


□ 


(A.80) 


Af n = 0(dm7 2r + \E\~ 2r ), 


(A.81) 


and 


7m = 0(dm7 r+a+1/2 + \E\ mf r+a+1/2 ). 


(A.82) 


To keep e 7m bounded, we require 


107 








1 


mi = Q(d r - a ~ 1 / 2 ), 

(A.83) 

i 

m 2 = nd^i*-®- 1 / 2 ). 

(A.84) 


From lemma A.2.1 the estimation error satisfies, for \ n > lZ*(jl e — /ig), 


KL (pg | p ? ) = 0 (Xl |B| + ||ft, - 


(A.85) 


Applying lemma A. 2 . 2, lZ*(jl e — fi *) > \ n with probability no more than 


2i2\ 


2 exp < log(m 2 <i 


K n 


8em2&(m2) 4 J 


(A.86) 


choosing X n = —— ' {md.) , ^ ie probability j s bounded by — • Also, 


rn^cP 


rtlll > ^ 


(A.87) 


with probability no more than j. Thus applying lemma 
we have 


A. 0.1 


under the stated conditions 


KL (P I Pg) = KL(p | pg) + KL(p^ | p~) 


= O p I dm l 2r + \E\ m 2 zr + 


_ 2r m^ 6 (m 2) 4 log(nd) 


n 


E\ + ^d 

n 


(A. 88 ) 
(A.89) 

□ 
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A.3 Model Selection 


Taylor expansion in Hilbert Space 

For any / = (9, 0) e W 2 , ||0||2 = ||/||l 2 < oo. Thus the set of such 9 belong to a Hilbert 
space. Applying the Taylor expansion with remainder for Hilbert spaces, for ||J|| < oo we 
have 


Z{9 + 8) = Z(9) + DZ{9) ■ 8 + -D 2 Z{8) ■ 8 2 + -D 3 (9 + z8 ) • J 3 . (A.90) 

2 6 


here z is some point in [0,1], and D^Z is the kth Frechet derivative of Z represented 
as a multilinear map, -8^ denotes evaluation at (8, Now, consider 9*, 8 which can 

be decomposed to a truncated vector and a remainder, 9 * = (9*,9^,) T and 8 = (8, <hp) T 
and write b = (4>,4>t)- Taking the derivative of Z with respect to 9*, and inputting the 
particular expression for the terms of the Taylor series (see [Portnoy, 1988j), we get the 
Taylor expansion for /x. 


fi(9* + 8)- n(9*) = covq* [<f>, <t>]8 + xE 0 * +Z( 5[(5, b) 2 b] 


2 % 


(A.91) 
(A.92) 

((M-%[</>]) + (5t,0T - %[0t])) 2 ' (</>-%[</>]) 


covgi* [b, cp]8 + covq* [b, br] ■ $t 

l 


R 


Here #1 = 9* + z8 for some 0 e [0,1]. We denote the final term the remainder , R := R(8). 


Primal Dual Witness 

Our proof technique is the primal dual witness method, used previously in analysis of model 

selection for graphical models Jason D. Lee[ 2014; Ravikumar et al., 2011] , It proceeds as 

follows: construct a primal-dual pair (9,Z) which satisfies supp($) = supp(0*), and also 
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satisfies the stationary conditions for 2.3 with high probability, 
for 12.31 are 


The stationary conditions 


H — n(6) + A n Z — 0, 
where Z is an element of the subdifferential dlZ(6 e ): 


(A.93) 


{Oij : \\Qijh < 1}) if 0*j = 0; 


(3ft(0 e ))y = 



if 9 t j ^ 0; 


0, if i = j- 


(A.94) 


From this we may conclude that the solution to 2.3 is sparsistent. In particular, we have the 
following steps: 


1. Set 9jt;c = 0; 

2. Set Zij = dntfDij = ^|y for (i,j) e E- 

3. Given these choices for 9^c and Zg, choose 9e and Zgc to satisfy the stationary 
condition 12.31 


2.3 


For our procedure to succeed, we must show this primal-dual pair ( 9 , Z) is optimal for 
, in other words 


9ij ^ 0 , 

for 

(i,J) e E; 

(A.95) 

11 11 ^ 1 1 

for 

{hj)?E. 

(A.96) 


In the sequel we show these two conditions hold with probability approaching one. 

We begin by proving a bound on the remainder. 
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Lemma A.3.1. The remainder (A.93) satisfies 


R= (||A £ ||L + II«tII~)%. 


Proof. By Cauchy-Schwarz, 


(4>E - &e) < W^eWooW^E - E 6/tbE]||L 


and similarly for 


((f>T ~ < 11 Oji 11 oo 110T — E 0t t^T ] 111 • 


So 


R < max{||A||oo, ll4ll»} 2 E[(|fe||l + Ml) 2 • E$)] 

= max{|| A||oo, Halloo } 2 K 

< (IIAeIIL + ll«rll~)^- 


Applying the norm gives 


Halloo < (Willie + ||^j’lloo) K l?max{6(m 2 ) 2 , b(mi)}. 


(A.97) 

(A.98) 

(A.99) 


(A.100) 

□ 
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Condition (A.95) 


Denote b : = max{ 6 (m 2 ) 2 , b(rrii)} and m := max{m 2 , 

Lemma A. 3.2. Let 

f := 2/crdlWsHoo + A n / m + ( K T + l)||^j i || 0 o)) (A.101) 

and suppose f < 2bK R K. r anc ^ II^tIIooAR^ — 1- Then 

\\e E -0* E \\oo<r. (A. 102 ) 

Proof. The stationary condition for 9^ is given by 

AE — v{6)e + ^uZe — 0- (A.103) 

Denote W := J1 — ft*. Then we may re-write 

We + fig — [J,(9 )e + Xn^E (A.104) 

— We — Fee^e — ^et^t ~ Re + An-^E, (A.105) 

where A e = Re-arranging and applying the L 2 norm, we get for (i,j) € E, 
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Ay || 2 = l|r-^(-w £ + r ET o* T + r e - x„z e )\\ 2 (a.ios) 

= Ilfy-clKII w £ ||2 + mVdTEWr^um^ + \\R E h + W\z E h) (A.107) 

< ,, 1 1111 I 1 + ttl-J d + | A) - ?L + ||i?£:||2 + Vd + EX n ). (A.108) 

m\J a + E 

— Kp (Halloo + «r||0rl|oo Halloo + ^ n/ m ) • (A.109) 


Consider the mapping 


.-1 

E> 

,-l 


f (^e) ■■= -r ee (We - t ee a e - T ev 6 e - R e (A e ) + A n Z E ) + a e 


= -r EE (w E - r et 0 t - r e + x n z E ). 


(A.110) 
(A.Ill) 


Due to the uniqueness of the solution to the stationary conditions, F(A E ) has a unique 
fixed point F{A E ) = A E at A = 9 — (9*). If we can show that ||F(A£;)|| 00 < r for every 
11 A e 11oo < f, since F is continuous and {A^; : UA^Uoq < f} is a convex and compact 
set, applying Brouwer’s fixed point theorem |Ortega and Rheinboldt, 2000 implies that the 
unique fixed point of F satisfies HA^Hoo < r. The L 2 norm of the map follows, for (i,j) € E , 


\\ F (A E )ij II2 < II r : AII2(II ^11 2 + "ilir^rllooll^rlloc + \\ R E h + K\\Z E h) (A. 112 ) 


u 


< Kp (|| || oo + KtII^tIIoo + WReWoo + A n/m) . 


(A.113) 


Let r := 2fi)p(||fTp;|| o0 + (k e + l)||6 l ^|| 0o + X n /m) and consider HA^Hoo < r for (i,j) £ E. 
Suppose that also r < 2 bn R n T ' 


From lemma A.3.1 the remainder is bounded by 
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(A.114) 


Ifelloo <bk R (\\A E \\l+ 119^1). 


If ll A £;||oo < 2bk R K r ’ and W°Th bhi R - C this is bounded by 


1 

2Kp 


A£;||oo + ||4IIoo, 


Thus, since ||A||oo < maxy||Fy|| 2 , 


ll^(A^)lloc A K r ( IIW^eHoo + («r + l)IICp||oo + ^n/ m + 


IA 


E II oo 


2/tp 



It follows that the fixed point A r = Oe — 0* E satisfies 


(A.115) 


(A.116) 
(A.117) 


IIAjp||oo A 2 kp(||W^||oo + ^n/ ? u + (kt + 1)||6 , ^|| (X) ). (A.118) 

□ 


Condition (A.96) 


Lemma A.3.3. Suppose that 


max 


{8Kpm/7 


),m(||Il / ||oo + (1 + k t)II^tII°°)} - - 


rm 


Halloo A 


mm 


r A r 


128 1 


m 


4(1 + K/p)m’ j 


(A.119) 
(A.120) 


Then max^j^^c ||^j||2 < 1- 
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Proof. Recall that d E c = 6* EC = 0. For (i,j) G E c , the stationary condition gives 


Wrc - r E c E((0m)E - Qe) - r E C V^T ~ R E C + X n Z E c = 0 (A.121) 


Now, re-arranging (A.105), 


o%-e E = r ee(w e + a n z E + r e + r ET e* T ). (A. 122 ) 


It follows that 


z E c = — j -w E c- r E c E r EE (w E + t E c V o e + r e + a n z E ) - t E c T o e - r E c j . 

(A.123) 


Now, for (i,j) G E c , 


Zijh < y {ll^ijlb + lir^r EE (w E + r v v + r e + x n z E )\\2 + Wi'ij^rh + \\Rij Ibj 

(A.124) 

< ^{ll^ll2 + l|ry, E r^|| 2 h||w E || 2 + mVd + s Kr || 4 i | 00 + ||B E ||2 (A.125) 

+ ^n||Z E || 2 ) + Il-Rijlk) 

< ^{ll^-||2+ll^-|| 2 +^=d(lM 2 (A.126) 

+ niVd + Ef''/) '!)■ |j>_ + l|17 E || 2 f + (1 — 

< f- {m(2 - i-)(||W||oo + Halloo)} (A.127) 

^n 

+ t—(1 - r)hc T rm \\S E \\oo + 1 - t. 
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From the assumptions of the lemma, and applying Lemma A.3.2, we get 


||A_e||cx) < 2ftp(||hFg'|| 00 + Ajx/m + ( kj < + 1)Halloo) 
< 4/s;pA n /m. 


From Lemma 


A.3.1 


if 16&/€ft/tpA n < ™ and bn,R ||^||oo < nr, 


llfllloo < + H4fc) 

< 16 &ftRft 2 A 2 /m 2 + mH^ylloo 

- + -Halloo, 


if (1 + fi/rMl^lloo < |A„, 


we may bound 11 Z^j \| 2 by: 


(A.128) 
(A.129) 


(A.130) 
(A.131) 
(A.132) 


\Zijh < ( 2 


T >i + 1 


T H '^^11^7’llco 

Ar 


m 

r r „ 3r 

< 1 _ T+ _ + _ = i __ <i , 


(A.133) 
(A.134) 


□ 


Proof of Theorem 2. 5.15} For the assumptions of Lemma |A.3.3| to hold, we need 


A„ = n(m(||W||oo + ||4llc»))- 


From Lemma A.2.2 


> t with probability no more than 


2 exp < 2 log (nd) — 


nt z 


8 e max{ 6 (m 2 ) 4 , j ’ 


(A.135) 


(A.136) 
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and the truncation parameters satisfy 


||0y||oo = 0(max{m 1 r 1 ^ 2 ,m 2 ? 1//2 }). 


(A.137) 


Supposing that m 2 


mi, (A. 135) is satisfied with the choice 


A n x m 2 



+ m, 


—r—1/2 


(A.138) 


1 

with probability approaching one. Balancing the two terms, if we choose m 2 x n 2r + 1 + 4a , 
we get 


A r 


log(nd) 

2i—1 ~ 

fl 2r+l+4a 


(A.139) 


Lastly, for condtion 


A.95 


to hold, we need ma \\@ij 


n,\ 


00 


< p 

- ~T- 


Using lemma 


AT372 


this is satished if 


Ar 


m2P 


—)■ 0, 


(A.140) 


which is satished with high probability when 


1 

p* 


o 



2r+l \ 

77,2r+l+4a 

log (nd) 


(A.141) 


□ 
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APPENDIX B 


PROOFS FOR CHAPTER 4 


Proof of Proposition 4-2.2 


Consider the functional 


J(q) := E p 


(|| V logp — Vlogg) <E> x(l 



(B.l) 


If J(q) = 0, then it must be Vlogp = Vlogg a.e., because their integrated squared 
distance is zero with respect to a weight function which is nonzero a.e. This implies logg = 
logp + c a.e. for some constant c, but c = 0 because p and q must both integrate to one. 
Furthermore, J is non-negative so it is minimized when q — p. If p and q belong to an 
exponential family with respective natural parameters 9 and 9 ', 9 = 9' when the family is 
minimal. 

Now, 


J(q)=E p || V log g ® x(l - x )\\2 


2E ? ; 


( v * 1() g q ■ v * 1o sp) ® - x )\\l 


+ constant , 


_i£V 

the constant not depending on q. We have, by integration by parts, 


(B.2) 
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(B.3) 


Ep [(Vi loggV ?: logp)x»(l - Xi) 


J Pii^i log qVi \ogp)xi(l - Xi 

[ Pi~^~(Vi log Q) x i(l — Xi) 

J Pi 

Pi(x i) (Vi log qxi(l -Xi)) 

- Xi = 1 

- Pi(xi)(V i\ogqxi(l ~ Xi)) 


J Xn=0 


PiX/iiy,log <jxi(i - Xi)) 


- / P.Vj(Vjlog?a;i(l-if)), 


(B.4) 

(B.5) 


(B.6) 


where in the last line we applied the boundary assumption. Thus, we see that J(q) is 
equal to E p [h(X, g)] plus some terms which don’t depend on q , so from the argument above 
E p [h(X, q)] is minimized when p — q. We conclude that h is a proper scoring rule. 

□ 


B.l Parameter Estimation 

Lemma B.1.1. If n > Cmd and \ n > 272.* ((T — T)#* + K — K), with probability at least 
1 — 2d exp j — -jbymdj, the regularized score matching estimator 6 satisfies 


l|e-«*l| 2 <P^s/d+\E 


(B.7) 


Proof. Dehne the function 
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£(8) = C(6* +5)- C(6*) + A n{n{9* + 8) - 11(6*)) (B.8) 

= + ^) T r(0* + 5) + ( 6 * + 5) t K - ^(d*) T f (6*) J - ( 6*) J K (B.9) 

+ x n (n(d* + 8) - k{6*)) 

= ^5 t ?5 + <5 T (f e* + k) + a n (iz{6* + 8) - n(o*)) (b.io) 

= ^8 j fs + 5 J {?6* - T0* + K- K ) + A n (1Z(6* + 8) - 11(6*)). (B.ll) 


Since £(0) = 0, it must be that 8(5) < 0. 

Using the sub-Gaussian assumption, we may apply Vershynin 2010 Remark 5.51, which 
says for any c G (0, l),t > 1, with probability at least 1 — 2exp{— torrid}, if n > C(t/c)^md, 
then for any i G V, and any vector 8{ 


sj Pjij > sTr,* + «« 2 , 


(B.12) 


setting c = 



<J=, and t 


1/c we get that if n > Cd, with probability at least 1 — 


e 

2 


¥i 


2 


< 


sj fiSi. 


(B.13) 


Applying the union bound over all i G V, with probability at least 1 — 2d exp 



|<y T <s < <s t P<j, 


(B.14) 


where 5 = (5j,..., 5j ) T . 

By (generalized) Cauchy-Schwarz, 


120 









(<5, (r - T)6* + K - K) < K{$)ll*{{T - T)d* + K - K) 

<^ms P +K(s v± ). 


where 5 a denotes the projection of 5 onto the set A. From |Negahban et al., 2012 
3, because 7 Z is decomposable, it holds that 


n(6 * + 5) - n(9*) > n{5 P± ) - n(8p). 


Combining (B.16) and (B.17), 


{S, (f - r)0* + K- K) + y + S)<~ 

> -y%) - T ns ^) * 

Using the subspace compatibility constant we have that 

ns P )< */d+\E\¥p\\ < v / rf+T^n«ii. 

Thus conditioning on the aformentioned probability, 


£W>|W| 2 -yWI '/d + E 
= ll«ll (f M- 3 ~ y ^+ e ) ■ 


Now, consider the set 


(B.15) 

(B.16) 

Lemma 


(B.17) 


(B.18) 

(B.19) 


(B.20) 


(B.21) 

(B.22) 
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(B.23) 


C=\S: ||<5|| < 7 ^VdTE\ . 


C is a compact, convex set. Furthermore, for all S G dC, from (B.22) we see that 8(8) > 0. 
Also observe that 0 G intC. Since 8(8) < 0, it must follow that 8 G intC, in other words 


7A r 


\\0-0*\\ 2 < —VdTE. 


(B.24) 


□ 


Proof of Theorem f.5.5 . Applying a concentration bound to JiF — Kf- in addition to a union 


bound, we have that \\K — A"|| max > t with probability no more than exp{2 log(md) — c 2 nt 2 } 
for t < v, for constants c 2 , v > 0. Similarly, 


II(r — r)0 Umax A 2K0* ; il|r — r|| max , 


(B.25) 


and ||r — r11max > t with probability no more than exp{2 log(md) — c\nt^} for t < u 2 for 
ci, v 2 > 0. Furthermore, observe that for any vector 6 , 


TZ*{9) < y/^\\6\\ 


max) 


(B.26) 


mk ? „ log (md) 

thus setting \ n x y --, the conditions in Lemma 

probability approaching one. 


B.1.1 


will be satisfied with 

□ 
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B.2 Model Selection 


Our proof technique is the primal dual witness method, used previously in analysis of model 
selection for graphical models Jason D. Lee[ 2014; Ravikumar et al. 2011| . It proceeds as 
follows: construct a primal-dual pair (6, Z) which satisfies supp(6 l ) = supp(0*), and also 


satisfies the stationary conditions for 4.30 with high probability. The stationary conditions 
for 14.301 are 


T9 + K + X n Z = 0, 
where Z is an element of the subdifferential dlZ(9 e ): 


(B.27) 




| {Vij ■ Pijh < 1}, 

] djj II 


if Oij = 0; 
if Oij ^ 0. 


(B.28) 


From this we may conclude that there exists a solution to 4.30 is sparsistent. In particular, 
we have the following steps: 


1. Set 9pjc = 0; 

2. Set Zij = dn{0t) l3 = A for (i,j) e E- 

3. Given these choices for 9^c and choose 9^ and Z^c to satisfy the stationary 
condition 14.301 


For our procedure to succeed, we must show this primal-dual pair (0, Z ) is optimal for 


4.30, in other words 
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9ij ^ 0, 

for 

(i,j) e E-, 

(B.29) 

11% II < 1, 

for 

(i,j)?E. 

(B.30) 


In the sequel we show these two conditions hold with probability approaching one. 


Lemma B.2.1. Suppose that ||r_g_g; — r_g_g|| 


< 


1 


max ^ 2msn r 


. Then there exists a solution to 


(4.30), 9, satisfying 


II Qe - %||oo < 2^r ( 2 K 1;6 )||(f EE - Tee )Umax + II K - K Hoc + A n/y/rnj . (B.31) 

Proof. The stationary condition for 9e, observing that 9e c = 9 EC = 0 5 is given by 


r EE^E + k E + A hZe = o. 


(B.32) 


Re-arranging and observing that Tee@e = ~^E-> we have 


FeE^E + K E + A n Z = Tee^E - TeE^E + K E ~ K E + A(B.33) 

= (^ee - r ee)0 + T Ee(0e - 9%) + K E ~ K E + An%- (B.34) 


Consider the map 


F{A E ) = -T EE ({Tee ~ Eee){Ae + ®e) + ^ee^e + k e - k e + A ) + A E . 

(B.35) 
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F has a fixed point F(A E ) = A E at Ap; = 0 E — d* E for any solution 0. Define r : = 
2k t ^2K 1)0 ||(f ee - r^) 11 max + II I<E - K E \\oo + A n/V™) • If we can sllow ll-^( A )l|oo < f 
for each ||A||oo < r, from Brouwer’s fixed point theorem |Ortega and Rheinboldt], 2000 , it 
follows that some fixed point satisfies HAHoo < r. For UAUoq < r, 


\\ F ijh < , r = (II (Fee - t ee )(a e + q* e )\\2 + II k e - k e 11 2 + Anll^Elh) 

(B.36) 

< (||(r y> £? - r ij,E)(^E + ^lloo + II-Ae - A^lloo + A n/\pm s j • (B.37) 


Now, 


II (r EE - r^Alloo < < max ^2 

1 ieV . ~Tb 

j£V^k<m 


A k - 

ij 


l|r EE - ^ EE 11 max (B.38) 

< ms||A||oo||r_B_B - 11max, (B.39) 


and similarly, 


I \(Tee ~ r ee)0*e\\oo < 2 ^ 1^11 T E e ~ r ^ lloo - ( B . 40 ) 

Thus, if \\f EE - r^|| max < 

||F||oo < max IIF^I^ < Kp (llT^ - T EE \\ max (2Ki j Q + msr) + ||K - K\\oo + A n /y/m) 

(B.41) 

<^ + ^<r. (B.42) 

a 


125 













Lemma B.2.2. Suppose that y/m\\K — K ||oo < m 1 / 2 llP-r|| max (hiQ + Sy/mX n ) < Z ^ L 

and Xn/Vm > 2KY(rnsKg\\T EE - T ^£;|| m ax + \\K - K ||oo)- Then for each (i,j) E E c , 


\\Zijh < !• 

Proof. For (i,j) E E c , the stationary conditions are 


(B.43) 


0 = T E c E 9 E + K E c + X n Z E c (B.44) 

= T E c E (d E - 9 e ) + (f E c E - T E c E )9 E (B.45) 


+ K E c ~ K E c + X n Z E c , 


re-arranging and plugging in the stationary conditions for 9 El we have for (i, j) E E c , 



~ ^ij,E T E e(~(^EE ~ ^EeWe ~ ( K E ~ K E) ~ ^nZ E ) 



^-ij,E)^E K'ij T K; L j 


(B.46) 


Applying the L 2 norm, 


\\Zijh = "I (XEE - ^EE)^e\\2 + II K E ~ K e\\2 + ^nll^clb) (B.47) 

+ \ftij,E ~ ^ij,E^Eh + \\Kij - ^/jlbj 

< Y n {^™( 2 - T )(ll^ - K Woo + II(r - Halloo} + 1 - r. (B.48) 

Observe that since \\9 — #* || max < f, 
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(B.49) 

(B.50) 


II(r — r) 6 >||oo < ||r - r|| max (K 10 + msr) 

— I|r r|| max + 2sy/mX r ^j , 


so 11 Zjj 11 2 is bounded by 

y- {(2 - T)y/m\\K - K ||oo + y/m(2 - r)||f - T\\(ki a + 2sy/mX n )\ + 1 - r. (B.51) 
if y/m\\K - K ||oo < ^ and - r|| max (2K lj 0 + 2s^/mX n ) < I ^ L , this is bounded 


(2_T) (i + i) + I ' T - 1 -’i <1 - 


(B.52) 


Proof of Theorem f.5.£\ Using a concentration bound for K'X — K'f- and applying a 


ij 


a 


union 


bound, we have that when t < v\ for some is, \\K — K \| max > t with probability no more 
than exp{2 log(md) — C 2 'nt 2 } for a constant C 2 - Similarly, 


II(r — r)0 Umax < 2k\ fl||r — r|| max , 


(B.53) 


and for t < U 2 , ||T — r|| max > t with probability no more than exp{2 log (md) — cpui 2 }. 
Thus setting X n = C\J ^ for sufficiently large C, and if \J m — o(l), 


the assumptions of lemma |B.2.2| will be satisfied with probability approaching one. Further, 
assumption (B.29) is satisfied when ||0 — fUHoo < Since ||9 — 0*||oo = 0(X n / ydn), we 


require 


= o(l). 


□ 
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Lemma B.2.3. Let 0^. be the fcth orthonormal Legendre polynomial on [0,1]. then 


x(l — x) 
x(l — x) 


d(pk ( x ) 


dx 

d 2 (pk{x) 


dx 2 


0(/fc 3 / 2 ), 

(B.54) 

0(fc 5 /2). 

(B.55) 


Proof. From Bo nn et’s recursion formula Abramowitz and Stegun, 1964 , 


x 0- - = \ ( ( 2x - h-i( x ) 1 , 


so taking absolute values of each side, and using 10/, | < \/2k + 1, 

, d4k 


x(l — x) 


dx 


< k (\(j) k \ + \<fk —11) 
= O 


Now, using Legendre’s differential equation [Abramowitz and Stegun 1964 


2 

4x(l — x )—+ 2(2x — 1) — — k(k + 1)0/. = 0, 

dx z dx 


and using the fact that < k{k+l)^/2k+l , wg ft nc j that 


x(l — x) 


d 2 (f>k 


dx 2 


1 

< - 

dfk 

“ 2 

dx 


+ ~fk(k + 1 ) | 0/-1 


= O 


(B.56) 


(B.57) 

(B.58) 


(B.59) 


(B.60) 

(B.61) 


Proof of Theorem f.5.13. The proof technique is essentially the same as Theorem 4.5.9 


□ 


so 


we omit some details. The main difference is that here K* = — T*#* — so we must 

deal with one additional term in the analysis, the bias from truncation. Suppose = m 2 - 
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We choose X n so that with high probability, 


Xn = X} EE ~ r ee\\ max + m2\\K — K || max + «T m 2ll#:rl|max) , (B.62) 

An ->• 0, (B.63) 


as well as requiring n = h2(m 2 s 2 log md). 


Now, applying Lemma B.2.3, ||^4(a;) ||max = O (m 2 ), so applying Hoeffding’s inequality 


and a union bound as well as the boundedness assumption 


4.5.12 


EE 


EJsllmax 5; 


c 


m, log(m2d) 


with probability approaching one, for sufficiently large constant C. Similarly, 
\\K — K\\ma,x < CV — > f ,M) w ith probability approaching one. Furthermore, ||0j’|| m ax = 
0(m 2 ? 1//2 ). Thus, supposing q = 0(m 2 ), we need 


An x O 


'mi 2 log(m 2 <i) 


n 


+ m 


—r+1/2 


(B.64) 


Balancing the two terms, we choose m 2 x n 2r + 13 , so A r 


l°g(nd) . rpj ie stated sample 

yi ‘2 t —|— 13 ^ 

complexity ensures that X n —» 0. Furthermore, 1 9 e — #|;||max = O (A n /m 2 ), so we require 

Ar 


ni2P 


-> 0. 


□ 
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APPENDIX C 


SOFTWARE 

Software for Chapter 3 is available at https://github.com/geb5101h/trw. Software for Chap¬ 
ter 4 is available at https://github.com/geb5101h/quasr. 
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